Senior Platform and EngOps Engineer - Cluster Operations at NVIDIA Corporation - ScoutJobs - The AI-curated global job board
Skip to content
NVIDIA Corporation
Posted a day ago

Senior Platform and EngOps Engineer - Cluster Operations

NVIDIA CorporationSenior Platform and EngOps Engineer - Cluster Operations

Requirements

BS or MS in Computer Science or related field, 8+ years experience in cluster and server administration, Expertise in Ansible, Python, and Shell Scripting, Deep understanding of OS and computer networks, Proficiency in Linux fundamentals

Skills

AnsiblePythonLinuxDevOps

About the role

Responsibilities

  • Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand
  • Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability
  • Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal performance
  • Manage the rollout and rollback of cluster software and firmware updates to ensure minimal disruption
  • Collaborate with Engineering and Product Teams across multiple time zones to align operations with project requirements

Requirements

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field
  • 8+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
  • Expert-level automation skills in Ansible, Python, and Shell Scripting
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • Proficiency with Linux fundamentals

Preferred Qualifications

  • Familiarity with resource scheduling managers, preferably Slurm
  • Direct experience with industry-standard alerting tools and emergency response practices
  • Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters
  • Proficiency in designing large-scale networking technologies and associated challenges
  • Experience crafting and implementing robust metrics collection and alerting infrastructure

About the Company

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free β€” we'll pull jobs that fit your CV from across the web and rank them for you.

Get started β€” it's free

Senior Platform and EngOps Engineer - Cluster Operations

NVIDIA Corporation Β· Santa Clara

Sign up to apply