Site Reliability Engineer at OpenAI - ScoutJobs - The AI-curated global job board
Skip to content
OpenAI
Posted a day ago

Site Reliability Engineer

OpenAISite Reliability Engineer, Frontier Systems Infrastructure

Requirements

Experience in infrastructure or distributed systems engineering, Deep knowledge of Kubernetes internals, Proficiency in Python or Go, Familiarity with Infrastructure-as-Code tools, Experience with bare-metal Linux environments

Skills

KubernetesPythonGoTerraformLinux

About the role

Responsibilities

  • Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and lifecycle management
  • Build software abstractions that unify multiple clusters to present a seamless interface to training workloads
  • Own node bring-up from bare metal through firmware upgrades to ensure repeatable deployment at scale
  • Improve operational metrics, such as reducing cluster restart times and accelerating upgrade cycles
  • Integrate networking and hardware health systems to deliver end-to-end reliability across servers and switches
  • Develop monitoring and observability systems to detect issues early and maintain stability under extreme load

Requirements

  • Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
  • Deep knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
  • Proficiency in Python, Go, or similar programming languages
  • Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation
  • Experience with bare-metal Linux environments, GPU hardware, and large-scale networking

Preferred Qualifications

  • Background with GPU workloads or high-performance computing (HPC)
  • Experience with firmware management and hardware-level automation

Benefits

  • Competitive salary range of $255K – $490K plus equity
  • Comprehensive medical, dental, and vision insurance
  • 401(k) retirement plan with employer match
  • Flexible PTO and paid parental leave
  • Daily meals in the office and mental health support
  • Annual learning and development stipend

About the Company

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of AI capabilities and seek to safely deploy them to the world through our products.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Site Reliability Engineer

OpenAI · San Francisco

Sign up to apply