Site Reliability Engineer, Machine Learning Systems at ByteDance - ScoutJobs - The AI-curated global job board
Skip to content
ByteDance
Posted 14 hours ago

Site Reliability Engineer, Machine Learning Systems

ByteDanceSite Reliability Engineer, Machine Learning Systems

Requirements

Bachelor's degree in Computer Science or related field, Proficiency in Go, Python, or Shell, Linux environment experience, Hands-on experience with Kubernetes and containers, 1+ year of operation and maintenance experience

Skills

GoPythonKubernetesLinuxMachine Learning

About the role

Responsibilities

  • Ensure ML systems operate efficiently for large model deployment, training, evaluation, and inference
  • Maintain the stability of offline tasks and services across multi-data center, multi-region, and multi-cloud scenarios
  • Manage computing and storage resources, including resource planning, cost, and budget management
  • Oversee global system disaster recovery, cluster machine governance, and resource utilization improvement
  • Build software tools and systems to monitor and manage ML infrastructure and services efficiently
  • Participate in a global on-call roster to provide system and business support

Requirements

  • Bachelor's degree or above in Computer Science, Computer Engineering, or a related field
  • Strong proficiency in at least one programming language such as Go, Python, or Shell within a Linux environment
  • Hands-on experience with Kubernetes and containers
  • At least 1 year of relevant operation and maintenance experience

Preferred Qualifications

  • Experience in the operation and maintenance of large-scale ML distributed systems
  • Experience in the operation and maintenance of GPU servers
  • Strong logical analysis skills and the ability to abstract and split business logic
  • Excellent communication skills, self-driven attitude, and strong team spirit
  • Good documentation habits for maintaining technical workflows and documentation

About the Company

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico, ByteDance makes it easier and more fun for people to connect with, consume, and create content.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Site Reliability Engineer, Machine Learning Systems

ByteDance · Singapore

Sign up to apply