
Posted 14 hours ago
Site Reliability Engineer, Machine Learning Systems
ByteDanceSite Reliability Engineer, Machine Learning Systems
Requirements
Bachelor's degree in Computer Science or related field, Proficiency in Go, Python, or Shell, Linux environment experience, Hands-on experience with Kubernetes and containers, 1+ year of operation and maintenance experience
Skills
GoPythonKubernetesLinuxMachine Learning
About the role
Responsibilities
- Ensure ML systems operate efficiently for large model deployment, training, evaluation, and inference
- Maintain the stability of offline tasks and services across multi-data center, multi-region, and multi-cloud scenarios
- Manage computing and storage resources, including resource planning, cost, and budget management
- Oversee global system disaster recovery, cluster machine governance, and resource utilization improvement
- Build software tools and systems to monitor and manage ML infrastructure and services efficiently
- Participate in a global on-call roster to provide system and business support
Requirements
- Bachelor's degree or above in Computer Science, Computer Engineering, or a related field
- Strong proficiency in at least one programming language such as Go, Python, or Shell within a Linux environment
- Hands-on experience with Kubernetes and containers
- At least 1 year of relevant operation and maintenance experience
Preferred Qualifications
- Experience in the operation and maintenance of large-scale ML distributed systems
- Experience in the operation and maintenance of GPU servers
- Strong logical analysis skills and the ability to abstract and split business logic
- Excellent communication skills, self-driven attitude, and strong team spirit
- Good documentation habits for maintaining technical workflows and documentation
About the Company
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico, ByteDance makes it easier and more fun for people to connect with, consume, and create content.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeSite Reliability Engineer, Machine Learning Systems
ByteDance · Singapore
