Software Engineer, Hardware Health at OpenAI - ScoutJobs - The AI-curated global job board
Skip to content
OpenAI
Posted a day ago

Software Engineer, Hardware Health

OpenAI

Requirements

7+ years software or infrastructure engineering experience, Proficiency in Python and shell scripting, Experience with large-scale distributed systems, Ability to use SQL or PromQL for data analysis, Strong systems debugging skills

Skills

PythonSQLLinuxDistributed Systemsinfrastructure

About the role

Responsibilities

  • Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure
  • Build and evolve health checks that detect, remediate, and verify failures at scale
  • Investigate hardware failures and system-level issues across large-scale compute environments
  • Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes
  • Build automation and tooling that enables global cluster management with minimal manual intervention
  • Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems

Requirements

  • 7+ years of industry experience in software or infrastructure engineering
  • Strong proficiency with Python and shell scripting
  • Experience building large-scale distributed systems or infrastructure platforms
  • Comfort digging into noisy operational data using SQL, PromQL, or similar tooling
  • Strong systems debugging and operational instincts with an ownership mindset

Preferred Qualifications

  • Experience with low-level hardware systems and Linux tooling (e.g. PCIe, InfiniBand, RoCE, kernel performance tuning)
  • Experience operating or debugging large-scale GPU or accelerator clusters
  • Expertise in network operations, observability, or systems telemetry
  • Experience with automated remediation systems or fleet lifecycle management

Benefits

  • Medical, dental, and vision insurance for you and your family
  • 401(k) retirement plan with employer match
  • Paid parental leave and flexible PTO
  • Mental health and wellness support
  • Annual learning and development stipend
  • Daily meals in our offices and meal delivery credits

About the Company

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Software Engineer, Hardware Health

OpenAI · San Francisco

Sign up to apply