Senior HPC Engineer at MBZUAI - ScoutJobs - The AI-curated global job board
Skip to content
M
Posted a day ago

Senior HPC Engineer

MBZUAISenior HPC Engineer – IFM

Requirements

Bachelor's degree in CS or related field, 5+ years in HPC or Linux infrastructure, Experience with Slurm and Linux administration, Troubleshooting compute, storage, and networking systems

Skills

HPCLinuxGPUSlurmCUDAInfiniBandPyTorch

About the role

Responsibilities

  • Lead the operation and optimization of large-scale GPU clusters
  • Drive reliability, scalability, and performance improvements across the infrastructure
  • Perform troubleshooting and root cause analysis for complex compute, storage, and networking issues
  • Design and validate new cluster deployments and system upgrades
  • Collaborate with researchers to optimize distributed AI training workloads
  • Lead vendor engagement and technical reviews
  • Mentor junior engineers and define operational standards, monitoring, and capacity planning processes
  • Participate in major incident management and escalations

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, Software Engineering, IT, Applied Mathematics, Physics, or a related field
  • 5+ years of experience in HPC, Linux infrastructure, cloud infrastructure, distributed systems, or large-scale production environments
  • Proven experience with Slurm and Linux administration
  • Strong ability to troubleshoot complex compute, storage, and networking systems

Preferred Qualifications

  • Experience with GPU cluster operations and NVIDIA technologies (CUDA, NCCL, NVLink, GPUDirect)
  • Proficiency with InfiniBand networking
  • Experience with high-performance storage platforms such as Weka, Lustre, or BeeGFS
  • Familiarity with cloud providers including Azure, AWS, or GCP
  • Experience with Infrastructure-as-Code tools like Terraform or Ansible
  • Knowledge of large-scale AI training environments such as PyTorch Distributed, Megatron-LM, DeepSpeed, or FSDP
  • Master's degree in a relevant discipline

About the Company

MBZUAI's Institute of Foundation Models (IFM) operates one of the world’s largest AI-focused supercomputing environments. We provide the technical foundation for frontier AI research, driving innovation through large-scale GPU infrastructure and groundbreaking research and development.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Senior HPC Engineer

MBZUAI · United Arab Emirates

Sign up to apply