
Posted 4 days ago
Staff Site Reliability Engineer (SRE)
GRAILStaff Site Reliability Engineer (SRE)
Requirements
BS in Computer Science or related field, 8+ years SRE or DevOps experience, Cloud platform expertise (AWS, GCP, or Azure), Infrastructure-as-code (Terraform, CloudFormation), CI/CD pipeline design, Kubernetes production experience, Scripting proficiency (Python, Go, Bash), Observability tools (Prometheus, Grafana), Regulated environment experience (HIPAA, SOC 2)
Skills
AWSKubernetesTerraformPythonGoCI/CD
About the role
Responsibilities
- Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure
- Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
- Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible
- Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management
- Establish and evolve observability platforms and define SLO/SLI frameworks across teams
- Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements
- Optimize infrastructure for cost, performance, and scalability
- Mentor engineers and contribute to technical leadership through design reviews and standards
Requirements
- BS in Computer Science, Engineering, or a related field, or equivalent experience
- 8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
- Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)
- Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)
- Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)
- Hands-on experience with Kubernetes and containerized systems in production environments
- Proficiency in scripting or programming for automation (e.g., Python, Go, Bash)
- Experience with observability and monitoring tools (e.g., Prometheus, Grafana, Datadog)
- Experience working in regulated environments (e.g., HIPAA, SOC 2, ISO 27001, or NIST)
Preferred Qualifications
- 10+ years of experience in SRE, DevOps, or infrastructure engineering
- Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale
- Familiarity with GitOps practices (e.g., ArgoCD, Flux)
- Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake)
- Strong background in cloud security, including IAM and zero-trust architecture
- Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)
- Exposure to AI/ML or large-scale data infrastructure workloads
About the Company
GRAIL is a healthcare company pioneering new technologies to advance early cancer detection. We use next-generation sequencing (NGS), population-scale clinical studies, and state-of-the-art computer science to transform cancer care and change the trajectory of cancer mortality.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeStaff Site Reliability Engineer (SRE)
GRAIL · Menlo Park
