Requirements

BS in Computer Science or related field, 8+ years SRE or DevOps experience, Cloud platform expertise (AWS, GCP, or Azure), Infrastructure-as-code (Terraform, CloudFormation), CI/CD pipeline design, Kubernetes production experience, Scripting proficiency (Python, Go, Bash), Observability tools (Prometheus, Grafana), Regulated environment experience (HIPAA, SOC 2)

Skills

AWSKubernetesTerraformPythonGoCI/CD

About the role

Responsibilities

Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure
Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible
Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management
Establish and evolve observability platforms and define SLO/SLI frameworks across teams
Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements
Optimize infrastructure for cost, performance, and scalability
Mentor engineers and contribute to technical leadership through design reviews and standards

Requirements

BS in Computer Science, Engineering, or a related field, or equivalent experience
8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)
Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)
Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)
Hands-on experience with Kubernetes and containerized systems in production environments
Proficiency in scripting or programming for automation (e.g., Python, Go, Bash)
Experience with observability and monitoring tools (e.g., Prometheus, Grafana, Datadog)
Experience working in regulated environments (e.g., HIPAA, SOC 2, ISO 27001, or NIST)

Preferred Qualifications

10+ years of experience in SRE, DevOps, or infrastructure engineering
Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale
Familiarity with GitOps practices (e.g., ArgoCD, Flux)
Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake)
Strong background in cloud security, including IAM and zero-trust architecture
Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)
Exposure to AI/ML or large-scale data infrastructure workloads

About the Company

GRAIL is a healthcare company pioneering new technologies to advance early cancer detection. We use next-generation sequencing (NGS), population-scale clinical studies, and state-of-the-art computer science to transform cancer care and change the trajectory of cancer mortality.

Staff Site Reliability Engineer (SRE)

Requirements

Skills

About the role

Responsibilities

Requirements

Preferred Qualifications

About the Company

Get matches like this delivered daily

Staff Site Reliability Engineer (SRE)