Research Engineer - LLM Training Infrastructure at ByteDance - ScoutJobs - The AI-curated global job board
Skip to content
ByteDance
Posted 20 hours ago

Research Engineer - LLM Training Infrastructure

ByteDanceResearch Engineer - LLM Training Infrastructure - Seed Infra

Requirements

Large-scale distributed training for LLMs, Python and/or C++, ML systems development, Parallelism strategies (DDP, FSDP), PyTorch, CUDA, NCCL

Skills

PythonC#PyTorchCUDALLM

About the role

Responsibilities

  • Conduct research and development on large-scale LLM training infrastructure and efficiency
  • Design and optimize distributed training strategies, including parallelism schemes and throughput scaling on large GPU clusters
  • Investigate system reliability and resilience techniques such as fast checkpointing and fault tolerance
  • Research and optimize network, scheduling, and GPU memory management across the training stack
  • Analyze performance bottlenecks in exascale training systems and propose data-driven optimization methods
  • Translate cutting-edge research ideas into scalable, real-world AI infrastructure solutions

Requirements

  • Experience with large-scale distributed training for LLMs
  • Strong programming skills in Python and/or C++
  • Strong background in ML systems and training infrastructure development
  • Proficiency in parallelism strategies (DDP, FSDP, model/pipeline/expert parallelism)
  • Solid understanding of training stack internals including PyTorch, CUDA, and NCCL
  • Experience in performance optimization regarding memory, communication, and throughput

Preferred Qualifications

  • Hands-on experience with distributed training frameworks and large-scale LLM infrastructure
  • Experience leading or mentoring engineering teams or cross-functional projects
  • Publications in top-tier AI, systems, or HPC conferences (ICML, OSDI, SOSP, NSDI, SIGCOMM, MLSys)
  • Strong open-source contributions to relevant projects
  • Familiarity with benchmarking AI accelerators or large-scale LLM evaluation

Benefits

  • Medical, dental, and vision insurance
  • 401(k) savings plan with company match
  • Paid parental leave
  • Short-term and long-term disability coverage
  • Life insurance and wellbeing benefits
  • 10 paid holidays, 10 paid sick days, and 17 days of Paid Personal Time per year

About the Company

ByteDance is a global technology company dedicated to inspiring creativity and enriching life. The ByteDance Seed team is focused on pioneering new paths toward artificial general intelligence, with research spanning MLLM, GenMedia, AI for Science, and Robotics. Our technology powers industry-leading foundation models and serves millions of users and enterprise customers worldwide.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Research Engineer - LLM Training Infrastructure

ByteDance · Seattle

Sign up to apply