Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
Responsibilities
- Design, develop, and optimize core training operators on AMD GPUs, including GEMM, Grouped GEMM, Attention, DeepEP, and related kernels, with a strong focus on maximizing performance.
- Analyze performance bottlenecks in large-scale model training workloads and drive end-to-end system-level optimizations.
- Work closely with hardware, compiler, runtime, and framework teams to continuously improve the performance, stability, and usability of the ROCm ecosystem.
- Contribute to advanced research and development initiatives, including next-generation GPU architectures, compute–communication fusion, and AGI-driven automatic generation of high-performance operators.
- Solid foundation in computer architecture and high-performance computing.
- Strong proficiency in C/C++, with hands-on experience in GPU programming and parallel development using HIP, CUDA, and Triton, and strong engineering implementation capabilities.
- Deep understanding of parallel computing principles and GPU execution models, with proven skills in performance profiling, analysis, and optimization.
- Practical experience with large-scale model training pipelines and operator-level performance optimization.
- Strong collaboration skills and the ability to work effectively across teams and technical domains.
- Familiarity with modern GPU architectures and performance tuning techniques (e.g., AMD CDNA4, NVIDIA Blackwell).
- Demonstrated experience optimizing high-performance kernels such as GEMM, Attention, Grouped GEMM, and DeepEP.
- Experience with collective communication primitives (e.g., AllReduce, All-to-All, ReduceScatter) and performance optimization.
- Experience in one or more of the following areas:
- Low-precision computing (FP8 / FP4)
- Compute–communication overlap
- Compiler optimizations
- Automatic generation of high-performance operators
- Experience developing or optimizing large-scale training systems such as Megatron-LM, TorchTitan, or similar frameworks.
ACADEMIC CREDENTIALS:
- Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
#LI-FL
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Apply on company website