
Posted a day ago
Senior Platform and EngOps Engineer - Cluster Operations
NVIDIA CorporationSenior Platform and EngOps Engineer - Cluster Operations
Requirements
BS or MS in Computer Science or related field, 8+ years experience in cluster and server administration, Expertise in Ansible, Python, and Shell Scripting, Deep understanding of OS and computer networks, Proficiency in Linux fundamentals
Skills
AnsiblePythonLinuxDevOps
About the role
Responsibilities
- Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand
- Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability
- Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal performance
- Manage the rollout and rollback of cluster software and firmware updates to ensure minimal disruption
- Collaborate with Engineering and Product Teams across multiple time zones to align operations with project requirements
Requirements
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field
- 8+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
- Expert-level automation skills in Ansible, Python, and Shell Scripting
- Deep understanding of operating systems, computer networks, and high-performance applications
- Proficiency with Linux fundamentals
Preferred Qualifications
- Familiarity with resource scheduling managers, preferably Slurm
- Direct experience with industry-standard alerting tools and emergency response practices
- Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters
- Proficiency in designing large-scale networking technologies and associated challenges
- Experience crafting and implementing robust metrics collection and alerting infrastructure
About the Company
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free β we'll pull jobs that fit your CV from across the web and rank them for you.
Get started β it's freeSenior Platform and EngOps Engineer - Cluster Operations
NVIDIA Corporation Β· Santa Clara
