Staff Site Reliability Engineer (SRE) at GRAIL - ScoutJobs - The AI-curated global job board
Skip to content
GRAIL
Posted 4 days ago

Staff Site Reliability Engineer (SRE)

GRAILStaff Site Reliability Engineer (SRE)

Requirements

BS in Computer Science or related field, 8+ years SRE or DevOps experience, Cloud platform expertise (AWS, GCP, or Azure), Infrastructure-as-code (Terraform, CloudFormation), CI/CD pipeline design, Kubernetes production experience, Scripting proficiency (Python, Go, Bash), Observability tools (Prometheus, Grafana), Regulated environment experience (HIPAA, SOC 2)

Skills

AWSKubernetesTerraformPythonGoCI/CD

About the role

Responsibilities

  • Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure
  • Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
  • Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible
  • Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management
  • Establish and evolve observability platforms and define SLO/SLI frameworks across teams
  • Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements
  • Optimize infrastructure for cost, performance, and scalability
  • Mentor engineers and contribute to technical leadership through design reviews and standards

Requirements

  • BS in Computer Science, Engineering, or a related field, or equivalent experience
  • 8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
  • Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)
  • Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)
  • Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)
  • Hands-on experience with Kubernetes and containerized systems in production environments
  • Proficiency in scripting or programming for automation (e.g., Python, Go, Bash)
  • Experience with observability and monitoring tools (e.g., Prometheus, Grafana, Datadog)
  • Experience working in regulated environments (e.g., HIPAA, SOC 2, ISO 27001, or NIST)

Preferred Qualifications

  • 10+ years of experience in SRE, DevOps, or infrastructure engineering
  • Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale
  • Familiarity with GitOps practices (e.g., ArgoCD, Flux)
  • Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake)
  • Strong background in cloud security, including IAM and zero-trust architecture
  • Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)
  • Exposure to AI/ML or large-scale data infrastructure workloads

About the Company

GRAIL is a healthcare company pioneering new technologies to advance early cancer detection. We use next-generation sequencing (NGS), population-scale clinical studies, and state-of-the-art computer science to transform cancer care and change the trajectory of cancer mortality.

ScoutJobs Agent

Get matches like this delivered daily

Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.

Get started — it's free

Staff Site Reliability Engineer (SRE)

GRAIL · Menlo Park

Sign up to apply