ML Infrastructure Service Reliability Engineer at Apple - ScoutJobs - The AI-curated global job board
Skip to content
Apple
Posted 4 days ago

ML Infrastructure Service Reliability Engineer

AppleML Infrastructure Service Reliability Engineer

Requirements

5+ years experience in cloud scaling, Deep expertise in Kubernetes, Proficiency in Python, Go, or Rust, Experience with Amazon S3 or GCS, Strong networking troubleshooting skills, Understanding of Linux internals

Skills

KubernetesPythonGoRustAWSGCPLinux

About the role

Responsibilities

  • Participate in a rotating on-call schedule, including occasional weekend coverage
  • Manage and scale Apple’s largest ML compute platform and multi-cloud storage abstraction
  • Oversee the full infrastructure stack from low-level nodes to complete network architecture
  • Leverage a diverse stack of open-source tools, commercial solutions, and internal systems
  • Drive automation and operational efficiency to ensure high availability and resilience

Requirements

  • 5+ years of experience building, operating, and scaling large applications in cloud environments
  • Deep expertise in Kubernetes, including hands-on experience with GKE or EKS
  • Proficiency in designing and developing code in Python, Go, or Rust
  • Practical experience with object storage technologies such as Amazon S3 or Google Cloud Storage (GCS)
  • Strong background in troubleshooting complex networking issues in public and private clouds
  • Solid understanding of Linux internals, standard networking protocols, and distributed systems

Preferred Qualifications

  • Proven drive to automate manual operations through continuous iteration
  • Experience managing diverse system environments using tools like Spinnaker, Helm, or Flux
  • Expertise in deploying, supporting, and monitoring large-scale distributed application stacks
  • Strong understanding of best practices for deploying large-scale distributed applications

About the Company

Apple creates transformative experiences that reshape entire industries. The ML Infrastructure team is responsible for managing the critical machine learning training workloads that power user-facing features across the Apple ecosystem.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

ML Infrastructure Service Reliability Engineer

Apple · Bengaluru

Sign up to apply