M
Posted a day ago
Senior HPC Engineer
MBZUAISenior HPC Engineer – IFM
Requirements
Bachelor's degree in CS or related field, 5+ years in HPC or Linux infrastructure, Experience with Slurm and Linux administration, Troubleshooting compute, storage, and networking systems
Skills
HPCLinuxGPUSlurmCUDAInfiniBandPyTorch
About the role
Responsibilities
- Lead the operation and optimization of large-scale GPU clusters
- Drive reliability, scalability, and performance improvements across the infrastructure
- Perform troubleshooting and root cause analysis for complex compute, storage, and networking issues
- Design and validate new cluster deployments and system upgrades
- Collaborate with researchers to optimize distributed AI training workloads
- Lead vendor engagement and technical reviews
- Mentor junior engineers and define operational standards, monitoring, and capacity planning processes
- Participate in major incident management and escalations
Requirements
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, Software Engineering, IT, Applied Mathematics, Physics, or a related field
- 5+ years of experience in HPC, Linux infrastructure, cloud infrastructure, distributed systems, or large-scale production environments
- Proven experience with Slurm and Linux administration
- Strong ability to troubleshoot complex compute, storage, and networking systems
Preferred Qualifications
- Experience with GPU cluster operations and NVIDIA technologies (CUDA, NCCL, NVLink, GPUDirect)
- Proficiency with InfiniBand networking
- Experience with high-performance storage platforms such as Weka, Lustre, or BeeGFS
- Familiarity with cloud providers including Azure, AWS, or GCP
- Experience with Infrastructure-as-Code tools like Terraform or Ansible
- Knowledge of large-scale AI training environments such as PyTorch Distributed, Megatron-LM, DeepSpeed, or FSDP
- Master's degree in a relevant discipline
About the Company
MBZUAI's Institute of Foundation Models (IFM) operates one of the world’s largest AI-focused supercomputing environments. We provide the technical foundation for frontier AI research, driving innovation through large-scale GPU infrastructure and groundbreaking research and development.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeSenior HPC Engineer
MBZUAI · United Arab Emirates
