
Posted 9 hours ago
Research Engineer – Multimodal Training Infrastructure (Seed Infra)
ByteDanceResearch Engineer – Multimodal Training Infrastructure (Seed Infra)
Requirements
expertise in large-scale distributed training of LLMs, strong systems research background, experience with parallelism strategies, strong programming skills, understanding of algorithm–system co-design
Skills
LLMMachine Learning
About the role
Responsibilities
- Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models
- Design and optimize distributed training strategies, including parallelism schemes, computation/communication optimization, and throughput scaling on large GPU clusters
- Investigate system reliability and resilience techniques such as fast checkpointing, fault tolerance, and failure diagnosis for long-running workloads
- Research and optimize network, scheduling, and GPU memory management across the training stack to drive cross-layer performance improvements
- Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods
- Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable infrastructure solutions
Requirements
- Deep expertise in large-scale distributed training of LLMs and multimodal models
- Strong systems research background with a demonstrated ability to design, build, and optimize large-scale ML systems
- Proven experience with parallelism strategies (e.g., data, model, pipeline, expert parallelism) and performance optimization on large GPU clusters
- Strong programming skills and hands-on experience implementing production-grade ML systems or infrastructure
- Solid understanding of algorithm–system co-design and cross-layer optimization for training efficiency, scalability, and reliability
Benefits
- Medical, dental, and vision insurance
- 401(k) savings plan with company match
- Paid parental leave
- Short-term and long-term disability coverage
- Life insurance and wellbeing benefits
- 10 paid holidays, 10 paid sick days, and 17 days of Paid Personal Time per year
About the Company
ByteDance is a global technology company dedicated to inspiring creativity and enriching life. The ByteDance Seed team is pioneering new paths toward artificial general intelligence, with research spanning MLLM, GenMedia, AI for Science, and Robotics. Our technology powers industry-leading foundation models and serves millions of users across various application scenarios worldwide.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeResearch Engineer – Multimodal Training Infrastructure (Seed Infra)
ByteDance · San Jose
