Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
THE ROLE
AMD is seeking a Director of Machine Learning Engineering to join our Models and Applications organization. In this role, you will define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs. You'll guide a world-class engineering team focused on scaling AI training efficiency, optimizing model performance, and advancing AMD's leadership in AI systems.
This position blends deep technical expertise with strategic leadership. You will partner closely with research, hardware, and software teams to shape the roadmap for AMD's AI training stack — driving innovation at both the model and application levels, influencing how next-generation AI models are trained and deployed efficiently on AMD platforms.
The ideal candidate is a strategic technical leader with a strong foundation in distributed training and AI infrastructure, coupled with experience building or guiding high-impact ML applications such as recommendation systems and ranking models. You combine visionary thinking with execution excellence, thrive in cross-functional collaboration, and are passionate about scaling AI systems that fully leverage AMD GPU performance across both model and application layers.
- Strategic Leadership & Vision: Define and drive AMD's distributed training strategy for large-scale generative and recommendation models. Align technical initiatives with broader AI platform goals and business impact.
- Technical Direction & Innovation: Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models. Explore new approaches for efficient training and inference of LLMs and ranking systems.
- Execution & Delivery: Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs. Ensure world-class efficiency, stability, and model convergence.
- Cross-Functional Collaboration: Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance.
- Team Leadership & Development: Build, mentor, and empower a team of expert engineers focused on innovation, collaboration, and technical excellence.
- Open Source & External Engagement: Drive AMD's engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM. Represent AMD's leadership in AI system design across industry and research communities.
- Research & Trends: Stay ahead of emerging advances in distributed training, LLMs, recommendation systems, and AI infrastructure — and translate them into scalable engineering practices.
- 10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles.
- Proven experience building and optimizing distributed training systems for large models.
- Prefer experience in both model and application-level development and optimization.
- Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM).
- Hands-on expertise with LLMs, recommendation systems, or ranking models.
- Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization.
- Experience collaborating across hardware, compiler, and system software layers.
- Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners.
Master's or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field.
LOCATIONSan Jose, CA or Bellevue, WA preferred. Other U.S. locations near AMD offices may be considered.
#LI-MV1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
Apply on company website