
Posted a day ago
Software Engineer, Hardware Health
OpenAI
Requirements
7+ years software or infrastructure engineering experience, Proficiency in Python and shell scripting, Experience with large-scale distributed systems, Ability to use SQL or PromQL for data analysis, Strong systems debugging skills
Skills
PythonSQLLinuxDistributed Systemsinfrastructure
About the role
Responsibilities
- Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure
- Build and evolve health checks that detect, remediate, and verify failures at scale
- Investigate hardware failures and system-level issues across large-scale compute environments
- Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes
- Build automation and tooling that enables global cluster management with minimal manual intervention
- Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems
Requirements
- 7+ years of industry experience in software or infrastructure engineering
- Strong proficiency with Python and shell scripting
- Experience building large-scale distributed systems or infrastructure platforms
- Comfort digging into noisy operational data using SQL, PromQL, or similar tooling
- Strong systems debugging and operational instincts with an ownership mindset
Preferred Qualifications
- Experience with low-level hardware systems and Linux tooling (e.g. PCIe, InfiniBand, RoCE, kernel performance tuning)
- Experience operating or debugging large-scale GPU or accelerator clusters
- Expertise in network operations, observability, or systems telemetry
- Experience with automated remediation systems or fleet lifecycle management
Benefits
- Medical, dental, and vision insurance for you and your family
- 401(k) retirement plan with employer match
- Paid parental leave and flexible PTO
- Mental health and wellness support
- Annual learning and development stipend
- Daily meals in our offices and meal delivery credits
About the Company
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeSoftware Engineer, Hardware Health
OpenAI · San Francisco
