
Posted 4 days ago
Senior Site Reliability Engineer
AnyscaleSenior Site Reliability Engineer
Requirements
5+ years SRE or DevOps experience, Experience with large-scale distributed systems, Proficiency in Python or Go, Experience with Terraform, Kubernetes architecture and troubleshooting, Multi-cloud experience (AWS, GCP, or Azure)
Skills
KubernetesTerraformPythonGoAWSGCPAzure
About the role
Responsibilities
- Architect and develop a unified perspective on cloud component utilization across the company
- Design and implement robust observability infrastructure for metrics, logging, and tracing
- Create monitoring and alerting systems that enable teams to contribute to overall reliability
- Establish testing infrastructure to support effective test writing and execution
- Define and champion organization-wide Service Level Objectives (SLOs) and Error Budgets
- Implement best practices and on-call systems to ensure efficient incident management
- Coordinate the creation and deployment of cloud-based services and track deployments
Requirements
- 5+ years of relevant work experience in a Site Reliability or DevOps role
- Deep experience managing large-scale distributed systems and microservices architectures
- Proficiency in Python or Go
- Extensive experience with Infrastructure as Code (IaC) tools like Terraform
- Hands-on experience architecting and troubleshooting production-grade Kubernetes clusters
- Experience in multi-cloud environments (AWS, GCP, or Azure)
Preferred Qualifications
- Demonstrated ability to mentor junior engineers and lead complex technical projects
- Proven track record of influencing engineering culture in high-growth environments
- Ability to leverage data from logging and tracing to identify long-term architectural trends
About the Company
Anyscale is on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We are commercializing Ray, a popular open-source project that creates an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, and Spotify use Ray to accelerate the progress of AI applications.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeSenior Site Reliability Engineer
Anyscale · San Francisco
