
Posted 2 days ago
AI/HPC System Performance Engineer
MetaAI/HPC System Performance Engineer
Requirements
Profiling distributed AI or HPC workloads, Debugging complex multi-layer system performance issues, Designing performance monitoring systems, Driving cross-functional technical projects, Bachelor's degree in Computer Science or related field, 6+ years of system performance engineering experience
Skills
PythonC#PyTorchTensorFlow
About the role
Responsibilities
- Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
- Develop and maintain performance analysis frameworks and dashboards to track GPU utilization, network bandwidth, and latency
- Investigate and resolve performance regressions in distributed AI training environments, including RDMA fabrics and collective communication libraries
- Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new cluster configurations
- Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
- Build tooling and automation to monitor HPC system health and detect anomalies
- Lead technical design reviews for architecture changes affecting AI workload performance
Requirements
- 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field
- Experience profiling and optimizing distributed AI or HPC workloads (GPU interconnects, RDMA, NCCL, or MPI)
- Experience debugging complex, multi-layer system performance issues across network fabric, OS, and application layers
- Experience designing and implementing performance monitoring systems and telemetry pipelines
- Experience driving cross-functional technical projects from requirements through production deployment
- Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field
Preferred Qualifications
- Experience developing systems software in C++
- Experience with machine learning frameworks such as PyTorch and TensorFlow
- Understanding of RDMA congestion control mechanisms on IB and RoCE Networks
- Understanding of AI training workloads and their specific demands on network infrastructure
- Demonstrated ability to integrate AI tools to optimize workflows and drive measurable impact
About the Company
Meta builds technologies that help people connect, find communities, and grow businesses. We are moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeAI/HPC System Performance Engineer
Meta · Menlo Park
