
Posted 20 hours ago
Research Engineer - LLM Training Infrastructure
ByteDanceResearch Engineer - LLM Training Infrastructure - Seed Infra
Requirements
Large-scale distributed training for LLMs, Python and/or C++, ML systems development, Parallelism strategies (DDP, FSDP), PyTorch, CUDA, NCCL
Skills
PythonC#PyTorchCUDALLM
About the role
Responsibilities
- Conduct research and development on large-scale LLM training infrastructure and efficiency
- Design and optimize distributed training strategies, including parallelism schemes and throughput scaling on large GPU clusters
- Investigate system reliability and resilience techniques such as fast checkpointing and fault tolerance
- Research and optimize network, scheduling, and GPU memory management across the training stack
- Analyze performance bottlenecks in exascale training systems and propose data-driven optimization methods
- Translate cutting-edge research ideas into scalable, real-world AI infrastructure solutions
Requirements
- Experience with large-scale distributed training for LLMs
- Strong programming skills in Python and/or C++
- Strong background in ML systems and training infrastructure development
- Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
- Solid understanding of training stack internals including PyTorch, CUDA, and NCCL
- Experience in performance optimization regarding memory, communication, and throughput
Preferred Qualifications
- Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
- Experience leading or mentoring engineering teams or cross-functional projects
- Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys)
- Strong open-source contributions to relevant projects
- Familiarity with benchmarking AI accelerators or large-scale LLM evaluation
Benefits
- Medical, dental, and vision insurance
- 401(k) savings plan with company match
- Paid parental leave
- Short-term and long-term disability coverage
- Life insurance and wellbeing benefits
- 10 paid holidays, 10 paid sick days, and 17 days of Paid Personal Time per year
About the Company
ByteDance is a global technology company dedicated to inspiring creativity and enriching life. The ByteDance Seed team is focused on pioneering new paths toward artificial general intelligence, with research spanning MLLM, GenMedia, AI for Science, and Robotics. Our technology powers industry-leading foundation models and serves millions of users and enterprise customers worldwide.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeResearch Engineer - LLM Training Infrastructure
ByteDance · Seattle
