SRE Leader
This role is for one of the Weekday’s clients
Min Experience: 16 years
Location: Remote (India)
JobType: full-time
Key Responsibilities
- Lead and mentor a team of Site Reliability Engineers (SREs), fostering a culture of operational excellence and continuous improvement.
- Develop and implement SRE best practices, including monitoring, alerting, and incident response strategies.
- Design and build scalable, highly available, and resilient architectures to ensure system reliability.
- Collaborate closely with engineering teams to optimize system performance, reliability, and capacity planning.
- Drive automation initiatives to minimize manual tasks and enhance operational efficiency.
- Define and enforce SLAs, SLOs, and error budgets to maintain the right balance between reliability and development velocity.
- Lead incident management, root cause analysis, and post-mortem processes, ensuring continuous improvement.
- Work with security teams to uphold compliance standards and implement best practices in infrastructure and operations.
- Research, evaluate, and integrate new tools, technologies, and methodologies to enhance reliability and efficiency.
Requirements
Qualifications & Experience
- 8+ years of experience in Software Engineering, DevOps, or Site Reliability Engineering (SRE).
- 3+ years of leadership experience, managing teams in an operational environment.
- Expertise in cloud platforms such as AWS, GCP, or Azure.
- Hands-on experience with Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Proficiency in programming/scripting languages such as Python, Go, or Bash.
- Strong experience with Kubernetes, Docker, and container orchestration.
- In-depth knowledge of monitoring, logging, and observability tools like Prometheus, Grafana, ELK, or Datadog.
- Expertise in CI/CD pipelines, automation, and deployment strategies.
- Strong problem-solving and analytical skills, with a data-driven approach.
- Excellent communication and leadership abilities to drive collaboration and innovation.
Preferred Qualifications
- Experience managing large-scale distributed systems and microservices architectures.
- Strong understanding of networking, security, and performance optimization.
- Knowledge of database reliability, covering both SQL and NoSQL databases.
- Prior experience working with high-traffic, mission-critical applications.