Search for More Jobs

Get alerts for jobs like this Get jobs like this tweeted to you

Company: AMD

Location: San Jose, CA

Career Level: Director

Industries: Technology, Software, IT, Electronics

Apply on company website View all jobs at this company

Description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE

AMD is seeking a Director of Machine Learning Engineering to join our Models and Applications organization. In this role, you will define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs. You'll guide a world-class engineering team focused on scaling AI training efficiency, optimizing model performance, and advancing AMD's leadership in AI systems.

This position blends deep technical expertise with strategic leadership. You will partner closely with research, hardware, and software teams to shape the roadmap for AMD's AI training stack — driving innovation at both the model and application levels, influencing how next-generation AI models are trained and deployed efficiently on AMD platforms.

THE PERSON

The ideal candidate is a strategic technical leader with a strong foundation in distributed training and AI infrastructure, coupled with experience building or guiding high-impact ML applications such as recommendation systems and ranking models. You combine visionary thinking with execution excellence, thrive in cross-functional collaboration, and are passionate about scaling AI systems that fully leverage AMD GPU performance across both model and application layers.

KEY RESPONSIBILITIES

Strategic Leadership & Vision: Define and drive AMD's distributed training strategy for large-scale generative and recommendation models. Align technical initiatives with broader AI platform goals and business impact.
Technical Direction & Innovation: Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models. Explore new approaches for efficient training and inference of LLMs and ranking systems.
Execution & Delivery: Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs. Ensure world-class efficiency, stability, and model convergence.
Cross-Functional Collaboration: Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance.
Team Leadership & Development: Build, mentor, and empower a team of expert engineers focused on innovation, collaboration, and technical excellence.
Open Source & External Engagement: Drive AMD's engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM. Represent AMD's leadership in AI system design across industry and research communities.
Research & Trends: Stay ahead of emerging advances in distributed training, LLMs, recommendation systems, and AI infrastructure — and translate them into scalable engineering practices.

PREFERRED EXPERIENCE

10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles.
Proven experience building and optimizing distributed training systems for large models.
Prefer experience in both model and application-level development and optimization.
Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM).
Hands-on expertise with LLMs, recommendation systems, or ranking models.
Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization.
Experience collaborating across hardware, compiler, and system software layers.
Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners.

ACADEMIC CREDENTIALS

Master's or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field.

LOCATION

San Jose, CA or Bellevue, WA preferred. Other U.S. locations near AMD offices may be considered.

#LI-MV1

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

Apply on company website

Director of Machine Learning Engineering -- Training and Performance Job Listing at AMD in San Jose, CA (Job ID 73067-en-us)

Description

About CareerArc

HR Solutions

Job Seekers

Director of Machine Learning Engineering -- Training and Performance Job Listing at AMD in San Jose, CA (Job ID 73067-en-us)

Description

Find Connections via Linkedin

General Tips

Asking for Help

Getting Introduced

About CareerArc

HR Solutions

Job Seekers