Requirements

Experience in infrastructure or distributed systems engineering, Deep knowledge of Kubernetes internals, Proficiency in Python or Go, Familiarity with Infrastructure-as-Code tools, Experience with bare-metal Linux environments

Skills

KubernetesPythonGoTerraformLinux

About the role

Responsibilities

Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and lifecycle management
Build software abstractions that unify multiple clusters to present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades to ensure repeatable deployment at scale
Improve operational metrics, such as reducing cluster restart times and accelerating upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers and switches
Develop monitoring and observability systems to detect issues early and maintain stability under extreme load

Requirements

Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Deep knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in Python, Go, or similar programming languages
Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation
Experience with bare-metal Linux environments, GPU hardware, and large-scale networking

Preferred Qualifications

Background with GPU workloads or high-performance computing (HPC)
Experience with firmware management and hardware-level automation

Benefits

Competitive salary range of $255K – $490K plus equity
Comprehensive medical, dental, and vision insurance
401(k) retirement plan with employer match
Flexible PTO and paid parental leave
Daily meals in the office and mental health support
Annual learning and development stipend

About the Company

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of AI capabilities and seek to safely deploy them to the world through our products.

Site Reliability Engineer

Requirements

Skills

About the role

Responsibilities

Requirements

Preferred Qualifications

Benefits

About the Company

Get matches like this delivered daily

Site Reliability Engineer