Back to Search Results
Get alerts for jobs like this Get jobs like this tweeted to you
Company: AMD
Location: Hyderabad, Telangana, India
Career Level: Mid-Senior Level
Industries: Technology, Software, IT, Electronics

Description



WHAT YOU DO AT AMD CHANGES EVERYTHING 

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.  Together, we advance your career.  



MTS SOFTWARE SYSTEM DESIGN ENGINEER (AI/ML, GPU, Drivers, Firmware)

 

OVERVIEW
We are seeking an experienced and versatile professional with expertise in validation strategy, automation, and quality for AI/ML model serving, GPU software stacks, device drivers, firmware, and cross-platform systems (Linux/Windows). You will build test frameworks, drive CI quality gates, perform performance and reliability testing, and lead cross-stack triage to ensure robust releases in a rapidly evolving environment.

 

KEY RESPONSIBILITIES: 

  • Own end-to-end test strategy for AI/ML workflows (PyTorch, vLLM), GPU runtimes, drivers, and firmware across kernel and user space.
  • Develop scalable automation frameworks spanning unit, integration, HIL (hardware-in-the-loop), system, and end-to-end tests.
  • Implement and maintain CI quality gates (GitHub Actions/Workflows, Jenkins), including automated build, test execution, artifact management, reporting, and flake reduction.
  • Design and execute performance, stress, reliability, soak, and long-haul tests targeting GPU compute, memory, I/O, and serving throughput/latency.
  • Validate cross-platform compatibility (Linux/Windows), covering driver interfaces, kernel interactions, firmware behavior, and runtime stability.
  • Create reproducible environments with containers/orchestration; instrument telemetry and observability for data-driven QA.
  • Apply agentic AI techniques to accelerate test generation, triage, and root cause analysis; integrate intelligent diagnostics into pipelines.
  • Develop rigorous test cases for low-level features (PCIe, DMA, interrupts, memory management), error handling, recovery, and fault injection.
  • Define and track quality KPIs (coverage, defect escape rate, MTTR, performance regressions) and drive continuous improvement.
  • Lead defect triage across hardware, firmware, driver, runtime, and model layers; collaborate with engineering to resolve issues rapidly.
  • Produce comprehensive documentation: test plans, procedures, fixtures, coverage maps, readiness criteria, and retrospectives.

MINIMUM QUALIFICATIONS: 

  • 8–12 years in QA/Test for systems software or platform engineering, with at least 4 years focused on GPU software, device drivers, or firmware validation.
  • Demonstrable ownership of validation for AI/ML pipelines and serving stacks using PyTorch and at least one modern inference framework (e.g., vLLM), including accuracy baselining and performance regression detection.
  • Proven expertise testing drivers and firmware with hands-on work in:
    • PCIe fundamentals (link training, BARs, MSI/MSI-X), DMA engines, interrupt handling, and memory models.
    • Failure modes: error injection, recovery paths, power/thermal events, and persistence across reboot/upgrade cycles.
  • Deep proficiency in Linux (kernel/user space) and practical experience with Windows driver ecosystems; ability to:
    • Read kernel logs and symbols, trace with ftrace/perf/ETW, and perform cross-layer debugging.
    • Build custom kernels/modules and analyze crash dumps (kdump, WinDbg).
  • Strong programming for test automation:
    • Python for framework and orchestration (pytest or equivalent), robust mocking/fixtures, and data-driven test generation.
    • C/C++ for low-level test harnesses, protocol exercisers, and performance micro-benchmarks.
    • Bash/PowerShell for environment setup, CI scripting, and reproducibility.
  • CI/CD mastery with GitHub Actions/Workflows and/or Jenkins:
    • Design gated pipelines with parallelization, artifact management, flaky test quarantine, and automated rollback criteria.
    • Integrate metrics, alerts, and quality reports; enforce go/no-go release thresholds.
  • Performance testing rigor:
    • Methodology for baselining, variance control, and noise isolation; application of statistical techniques (e.g., confidence intervals, A/B comparisons) to detect regressions.
    • GPU-focused profiling and analysis (e.g., perf counters, memory bandwidth, kernel occupancy).
  • Tooling fluency:
    • gdb, perf, ftrace, valgrind, WinDbg, ETW; log/trace correlation; containerized test environments (Docker) and familiarity with Kubernetes for distributed tests.
  • Exploratory testing mindset:
    • Hypothesis-driven investigation, boundary and adversarial testing, fuzzing (protocol/API), chaos/fault injection, and reverse-engineering of interfaces when documentation is limited.
  • Communication and leadership:
    • Clear, concise defect reporting; ability to drive triage across teams; establish and evangelize quality standards; maintain strong documentation discipline.

GOOD TO HAVE:

  • Lab ops for QA: rack mounting, server configuration, BMC/IPMI, BIOS/fw updates, network/storage setup, power/thermal profiling.
  • Front-end/UI testing experience for internal tools: ReactJS, web UI automation, accessibility and usability checks.
  • Backend/DB validation: REST/gRPC testing, SQL/NoSQL, schema migrations, data integrity, performance tuning.
  • Observability: Prometheus/Grafana, OpenTelemetry; integrating quality signals and alerts into CI/CD and release gates.

EDUCATIONAL QUALIFICATIONS:

  • BS/MS in Computer Science/Computer Engineering, or related discipline.

#LI-NR1



Benefits offered are described:  AMD benefits at a glance.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

 

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position.  AMD's “Responsible AI Policy” is available here.

 

This posting is for an existing vacancy.


 Apply on company website