SoC Firmware Engineering Manager, Annapurna Labs Machine Learning Acceleration, AWS
Company: Amazon
Location: Cupertino
Posted on: April 8, 2026
|
|
|
Job Description:
When a new Trainium or Inferentia chip comes back from the fab,
our code is the first software to touch it. We're looking for a
hands-on engineering manager who lives and breathes low-level
software — someone who's debugged register-level issues at 3am and
wants to build a team that does it better. Our SoC HAL (Hardware
Abstraction Layer) team owns the lowest layer of user-space
software on AWS's custom ML accelerator chips: the firmware that
boots, configures, and manages every hardware block on the SoC.
Your software runs as a shared library on embedded Linux, reaching
into the chip to program PCIe links, initialize HBM controllers,
configure PLLs, manage interrupt controllers, and orchestrate
fabric interconnects across 270 hardware block instances per chip —
all deployed across millions of servers in AWS's global fleet. Tech
stack: C++17, CMake, GoogleTest, Python, SystemVerilog DPI, SPI,
APB/AXI bus protocols, PCIe, UCIe, HBM, PLL, custom IPs As the SoC
Firmware Manager, you will: - Manage, coach, and grow a team of 6
engineers — set technical direction, own hiring, and create an
environment where strong engineers want to stay - Coordinate
deliverables across chip architects, RTL designers, verification
engineers, validation engineers, and platform software teams —
you're the single point of accountability for HAL readiness on
every new chip program - Own bring-up for new SoC tape-outs, from
first-silicon power-on through production fleet deployment -
Prioritize work across multiple concurrent chip programs and
customer teams, balancing urgent bring-up needs against long-term
architecture investments - Drive the architecture of our C++
template metaprogramming framework, BUTR (Built-in Unit Test for
Registers), and HITL (Hardware-in-the-Loop) test infrastructure -
Ship the same C++ codebase to three execution environments:
SystemVerilog DPI for chip verification, QEMU for emulation, and
Carbon OS on embedded microcontrollers for production fleet - Get
into the weeds alongside your team — debug register-level HW/SW
interactions, review code, and write code yourself when it matters
Most firmware teams target one platform and ship to a few thousand
units. We target three platforms from a single source tree and
deploy across AWS's global fleet — where a single register
misconfiguration can impact millions of servers. Our software must
be stateless, survive live-updates on running production servers
without reboots, and be correct down to individual register bits.
The microcontroller can reboot at any time — including during
customer workloads — and the HAL must resume managing the SoC by
querying hardware state on-demand. No cached state, no assumptions.
Your pre-silicon software runs in simulation and emulation months
before real silicon arrives. When the chip comes back from the fab,
you validate those predictions on real hardware — and when they
don't match, you figure out whether it's a silicon bug or a
software bug. For Trainium3, our HAL enabled a full ML training
workload within 12 hours of first power-on:
https://www.aboutamazon.com/news/aws/trainium-3-ultraserver-faster-ai-training-lower-cost
No ML background needed. Your firmware is the foundation that
enables ML training across clusters of thousands of interconnected
accelerators — you'll work on components like PCIe and HBM, but
won't need to understand ML itself. This role can be based in
Cupertino, CA or Austin, TX. The team is split between the two
sites. - 3 years of engineering team management experience - 7
years of professional software development in C or C++, including
embedded, firmware, or systems-level development - 4 years of
designing or architecting software systems (abstraction layers,
hardware/software interfaces) - Experience developing software that
interfaces directly with hardware: SoC, ASIC, FPGA, or embedded
microcontrollers - Experience with register-level programming and
hardware debug (waveform analysis, bus-level tracing, or similar) -
Experience in recruiting, hiring, mentoring/coaching and managing
teams of Software Engineers to improve their skills, and make them
more effective, product software engineers - Experience with
silicon bring-up or pre/post-silicon software validation -
Experience shipping software across multiple target platforms
(simulation, emulation, production hardware) - Familiarity with bus
protocols (APB, AXI, PCIe) or memory subsystems (HBM, DDR) -
Experience with C++ template metaprogramming or code generation
frameworks - Experience building or maintaining hardware
abstraction layers or board support packages Amazon is an equal
opportunity employer and does not discriminate on the basis of
protected veteran status, disability, or other legally protected
status. Los Angeles County applicants: Job duties for this position
include: work safely and cooperatively with other employees,
supervisors, and staff; adhere to standards of excellence despite
stressful conditions; communicate effectively and respectfully with
employees, supervisors, and staff to ensure exceptional customer
service; and follow all federal, state, and local laws and Company
policies. Criminal history may have a direct, adverse, and negative
relationship with some of the material job duties of this position.
These include the duties and responsibilities listed above, as well
as the abilities to adhere to company policies, exercise sound
judgment, effectively manage stress and work safely and
respectfully with others, exhibit trustworthiness and
professionalism, and safeguard business operations and the
Company’s reputation. Pursuant to the Los Angeles County Fair
Chance Ordinance, we will consider for employment qualified
applicants with arrest and conviction records. Our inclusive
culture empowers Amazonians to deliver the best results for our
customers. If you have a disability and need a workplace
accommodation or adjustment during the application and hiring
process, including support for the interview or onboarding process,
please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more
information. If the country/region you’re applying in isn’t listed,
please contact your Recruiting Partner. The base salary range for
this position is listed below. Your Amazon package will include
sign-on payments and restricted stock units (RSUs). Final
compensation will be determined based on factors including
experience, qualifications, and location. Amazon also offers
comprehensive benefits including health insurance (medical, dental,
vision, prescription, Basic Life & AD&D insurance and option
for Supplemental life plans, EAP, Mental Health Support, Medical
Advice Line, Flexible Spending Accounts, Adoption and Surrogacy
Reimbursement coverage), 401(k) matching, paid time off, and
parental leave. Learn more about our benefits at
https://amazon.jobs/en/benefits . USA, CA, Cupertino - 212,700.00 -
287,700.00 USD annually USA, TX, Austin - 184,900.00 - 250,200.00
USD annually
Keywords: Amazon, Arden-Arcade , SoC Firmware Engineering Manager, Annapurna Labs Machine Learning Acceleration, AWS, Engineering , Cupertino, California