
Posted 12 hours ago
Engineering Manager - ML Platform and Infrastructure
Applied IntuitionEngineering Manager - ML Platform and Infrastructure
Perks & benefits
Health InsuranceMedical InsurancePaid Leave
Requirements
3+ years engineering management experience, Experience leading infrastructure or platform teams, Deep experience with distributed systems or GPU computing, Experience operating large GPU clusters (1,000+ GPUs), Understanding of PyTorch Distributed, Megatron-LM, or DeepSpeed, Familiarity with InfiniBand, RDMA, Slurm, or Kubernetes
Skills
Machine LearningDistributed Systems
About the role
Responsibilities
- Grow and manage a team of world-class infrastructure and systems engineers to deliver a best-in-class ML platform.
- Own the design and evolution of frameworks for orchestrating distributed training and inference jobs across thousands of GPUs.
- Drive the buildout and scaling of GPU cluster infrastructure, making critical decisions on architecture, scheduling, and networking.
- Lead efforts to optimize training and inference performance, including throughput, fault tolerance, and GPU utilization.
- Set team goals and roadmaps in alignment with research milestones and production deployment requirements.
- Partner with research and stack development teams to accelerate iteration speed and remove bottlenecks.
Requirements
- 3+ years of engineering management experience, ideally leading infrastructure or platform teams.
- Deep experience with distributed systems, GPU computing, or large-scale ML infrastructure.
- Direct experience building or operating large GPU clusters (1,000+ GPUs).
- Strong understanding of distributed training frameworks such as PyTorch Distributed, Megatron-LM, or DeepSpeed.
- Familiarity with high-performance networking (InfiniBand, RDMA) and resource scheduling (Slurm, Kubernetes).
- Proven track record of building and operating systems that run reliably at massive scale.
Preferred Qualifications
- Background in training optimization techniques like mixed-precision training, pipeline/tensor/data parallelism, or checkpointing.
- Experience with inference optimization, including batching, model serving, quantization, or compiler-level optimizations.
- Familiarity with Physical AI domains such as autonomous driving, robotics, or simulation.
- Contributions to open-source ML infrastructure projects.
About the Company
Applied Intuition is powering the future of physical AI. Founded in 2017, we are creating the digital infrastructure needed to bring intelligence to every moving machine on the planet, servicing the automotive, defense, trucking, construction, mining, and agriculture industries.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeEngineering Manager - ML Platform and Infrastructure
Applied Intuition · Sunnyvale
