I
Posted 14 hours ago
HPC Engineer
Institute for Foundation ModelsHPC Engineer
Perks & benefits
Medical InsuranceHealth InsurancePaid Leave
Requirements
Bachelor's degree in CS or related field, 2+ years Linux systems administration or SRE experience, Strong Linux troubleshooting skills, Scripting proficiency in Python or Bash
Skills
LinuxPythonBashKubernetesAWSGPU
About the role
Responsibilities
- Monitor the health, performance, and availability of large-scale GPU clusters
- Respond to incidents and perform first-level triage
- Support researchers and troubleshoot job failures
- Execute operational runbooks and recovery procedures
- Validate cluster deployments, upgrades, and maintenance activities
- Track infrastructure utilization and operational metrics
- Develop automation and monitoring tools
- Contribute to documentation and reporting
Requirements
- Bachelor's degree in Computer Science, Computer Engineering, or a related technical discipline
- 2+ years of experience in Linux systems administration, SRE, DevOps, or HPC operations
- Strong Linux troubleshooting skills
- Proficiency in scripting with Python or Bash
Preferred Qualifications
- Experience with Slurm and GPU infrastructure
- Familiarity with AWS, Azure, or GCP
- Experience with monitoring tools such as Grafana, Prometheus, or Datadog
- Knowledge of containers and Kubernetes
- Exposure to AI/ML infrastructure or research computing environments
Benefits
- Comprehensive medical, dental, and vision benefits
- Annual bonus
- 401K Plan
- Generous paid time off, sick leave, and holidays
- Paid Parental Leave
- Employee Assistance Program
- Life insurance and disability coverage
About the Company
The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeHPC Engineer
Institute for Foundation Models · Sunnyvale
