
Posted a day ago
Site Reliability Engineer
OpenAISite Reliability Engineer, Frontier Systems Infrastructure
Requirements
Experience in infrastructure or distributed systems engineering, Deep knowledge of Kubernetes internals, Proficiency in Python or Go, Familiarity with Infrastructure-as-Code tools, Experience with bare-metal Linux environments
Skills
KubernetesPythonGoTerraformLinux
About the role
Responsibilities
- Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and lifecycle management
- Build software abstractions that unify multiple clusters to present a seamless interface to training workloads
- Own node bring-up from bare metal through firmware upgrades to ensure repeatable deployment at scale
- Improve operational metrics, such as reducing cluster restart times and accelerating upgrade cycles
- Integrate networking and hardware health systems to deliver end-to-end reliability across servers and switches
- Develop monitoring and observability systems to detect issues early and maintain stability under extreme load
Requirements
- Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
- Deep knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
- Proficiency in Python, Go, or similar programming languages
- Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation
- Experience with bare-metal Linux environments, GPU hardware, and large-scale networking
Preferred Qualifications
- Background with GPU workloads or high-performance computing (HPC)
- Experience with firmware management and hardware-level automation
Benefits
- Competitive salary range of $255K – $490K plus equity
- Comprehensive medical, dental, and vision insurance
- 401(k) retirement plan with employer match
- Flexible PTO and paid parental leave
- Daily meals in the office and mental health support
- Annual learning and development stipend
About the Company
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of AI capabilities and seek to safely deploy them to the world through our products.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeSite Reliability Engineer
OpenAI · San Francisco
