Posted 12 hours ago

Engineering Manager - ML Platform and Infrastructure

Applied IntuitionEngineering Manager - ML Platform and Infrastructure

Apply now

Perks & benefits

Health InsuranceMedical InsurancePaid Leave

Requirements

3+ years engineering management experience, Experience leading infrastructure or platform teams, Deep experience with distributed systems or GPU computing, Experience operating large GPU clusters (1,000+ GPUs), Understanding of PyTorch Distributed, Megatron-LM, or DeepSpeed, Familiarity with InfiniBand, RDMA, Slurm, or Kubernetes

Skills

Machine LearningDistributed Systems

About the role

Responsibilities

Grow and manage a team of world-class infrastructure and systems engineers to deliver a best-in-class ML platform.
Own the design and evolution of frameworks for orchestrating distributed training and inference jobs across thousands of GPUs.
Drive the buildout and scaling of GPU cluster infrastructure, making critical decisions on architecture, scheduling, and networking.
Lead efforts to optimize training and inference performance, including throughput, fault tolerance, and GPU utilization.
Set team goals and roadmaps in alignment with research milestones and production deployment requirements.
Partner with research and stack development teams to accelerate iteration speed and remove bottlenecks.

Requirements

3+ years of engineering management experience, ideally leading infrastructure or platform teams.
Deep experience with distributed systems, GPU computing, or large-scale ML infrastructure.
Direct experience building or operating large GPU clusters (1,000+ GPUs).
Strong understanding of distributed training frameworks such as PyTorch Distributed, Megatron-LM, or DeepSpeed.
Familiarity with high-performance networking (InfiniBand, RDMA) and resource scheduling (Slurm, Kubernetes).
Proven track record of building and operating systems that run reliably at massive scale.

Preferred Qualifications

Background in training optimization techniques like mixed-precision training, pipeline/tensor/data parallelism, or checkpointing.
Experience with inference optimization, including batching, model serving, quantization, or compiler-level optimizations.
Familiarity with Physical AI domains such as autonomous driving, robotics, or simulation.
Contributions to open-source ML infrastructure projects.

About the Company

Applied Intuition is powering the future of physical AI. Founded in 2017, we are creating the digital infrastructure needed to bring intelligence to every moving machine on the planet, servicing the automotive, defense, trucking, construction, mining, and agriculture industries.

ScoutJobs Agent

Get matches like this delivered daily

Get started — it's free

Engineering Manager - ML Platform and Infrastructure

Applied Intuition · Sunnyvale