
Posted 4 days ago
Site Reliability Engineer - Big Data
PhonePeSite Reliability Engineer - Big Data
Perks & benefits
AccommodationMedical InsuranceMobile AllowancePaid LeaveRelocation Allowance
Requirements
7+ years experience in big data ecosystems, Expertise in Linux, IP, Iptables, and IPsec, Proficiency in Perl, Golang, or Python, Hands-on Hadoop stack experience, Experience with Puppet, Salt, Chef, or Ansible, Knowledge of ELK stack, Grafana, and Prometheus
Skills
LinuxPythonHadoopKafkaDockerAnsiblePrometheus
About the role
Responsibilities
- Manage, maintain, and support complex, distributed big data ecosystems and Linux/Unix environments.
- Design and implement automation systems for provisioning, scaling, upgrading, and patching big data clusters.
- Lead on-call rotations and incident responses, conducting root cause analysis and driving postmortem processes.
- Troubleshoot and resolve complex production issues while identifying and implementing mitigating strategies.
- Design and review scalable, reliable system architectures to ensure high availability and performance.
- Develop tools and scripts to automate operational processes, reducing manual workload and increasing resilience.
- Monitor system performance and resource usage to identify bottlenecks and implement performance tuning.
- Collaborate with development teams to integrate SRE best practices into the software development lifecycle.
Requirements
- Over 7 years of experience managing and maintaining distributed big data ecosystems.
- Strong expertise in Linux, including IP, Iptables, and IPsec.
- Proficiency in scripting or programming with Perl, Golang, or Python.
- Hands-on experience with the Hadoop stack (HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot).
- Experience with configuration management tools such as Puppet, Salt, Chef, or Ansible.
- Knowledge of SRE logging and monitoring tools including ELK stack, Grafana, and Prometheus.
- Solid understanding of networking and DevOps tools like Docker and Git.
Preferred Qualifications
- Experience managing infrastructure on public cloud platforms (AWS, Azure, or GCP).
- Experience designing and reviewing system architectures for large-scale scalability and reliability.
- Proficiency with observability tools to visualize and alert on system performance.
Benefits
- Comprehensive insurance coverage including Medical, Critical Illness, Accidental, and Life Insurance.
- Wellness programs including an Employee Assistance Program and onsite medical center.
- Parental support including maternity, paternity, adoption, and day-care assistance.
- Retirement benefits including PF contributions, Gratuity, and NPS.
- Additional perks such as higher education assistance and car lease options.
About the Company
PhonePe is a leading digital payments platform in India, serving over 600 million registered users and 40 million merchants. We process over 330 million transactions daily and are committed to unlocking the flow of money and access to services for every Indian. At PhonePe, we empower our employees to own their work from day one and solve complex problems at a massive scale.
ScoutJobs Agent
Get matches like this delivered daily
Sign up free — we'll pull jobs that fit your CV from across the web and rank them for you.
Get started — it's freeSite Reliability Engineer - Big Data
PhonePe · Bangalore
