Senior Site Reliability Engineer at Anyscale - ScoutJobs - The AI-curated global job board
Skip to content
Anyscale
Posted 4 days ago

Senior Site Reliability Engineer

AnyscaleSenior Site Reliability Engineer

Requirements

5+ years SRE or DevOps experience, Experience with large-scale distributed systems, Proficiency in Python or Go, Experience with Terraform, Kubernetes architecture and troubleshooting, Multi-cloud experience (AWS, GCP, or Azure)

Skills

KubernetesTerraformPythonGoAWSGCPAzure

About the role

Responsibilities

  • Architect and develop a unified perspective on cloud component utilization across the company
  • Design and implement robust observability infrastructure for metrics, logging, and tracing
  • Create monitoring and alerting systems that enable teams to contribute to overall reliability
  • Establish testing infrastructure to support effective test writing and execution
  • Define and champion organization-wide Service Level Objectives (SLOs) and Error Budgets
  • Implement best practices and on-call systems to ensure efficient incident management
  • Coordinate the creation and deployment of cloud-based services and track deployments

Requirements

  • 5+ years of relevant work experience in a Site Reliability or DevOps role
  • Deep experience managing large-scale distributed systems and microservices architectures
  • Proficiency in Python or Go
  • Extensive experience with Infrastructure as Code (IaC) tools like Terraform
  • Hands-on experience architecting and troubleshooting production-grade Kubernetes clusters
  • Experience in multi-cloud environments (AWS, GCP, or Azure)

Preferred Qualifications

  • Demonstrated ability to mentor junior engineers and lead complex technical projects
  • Proven track record of influencing engineering culture in high-growth environments
  • Ability to leverage data from logging and tracing to identify long-term architectural trends

About the Company

Anyscale is on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We are commercializing Ray, a popular open-source project that creates an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, and Spotify use Ray to accelerate the progress of AI applications.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Senior Site Reliability Engineer

Anyscale · San Francisco

Sign up to apply