Role purpose
The DevOps & SRE Lead is responsible for ensuring the
reliability, scalability, security, and operational excellence
of enterprise data and AI platforms and applications.
This role combines
hands‑on technical leadership
with
site reliability practices
, enabling high‑quality delivery through automation, observability, and strong operational governance.
The role leads DevOps and SRE practices across platforms, works closely with data engineering teams, AI/ML, and product teams, and establishes standards that enable teams to build and run reliable systems at scale.
Knowledge, experience & capabilities
DevOps & Platform Engineering
Lead the design, implementation, and evolution of CI/CD pipelines.
Define and enforce DevOps standards, tooling, and best practices.
Drive Infrastructure‑as‑Code and environment consistency across QA, staging, and production.
Partner with application, data, and AI teams to embed DevOps practices early in development.
Site Reliability Engineering
Own platform reliability, availability, performance, and scalability.
Define and monitor SLOs, SLIs, error budgets, and reliability KPIs.
Lead incident response, root cause analysis, and post‑incident reviews.
Drive proactive reliability engineering through automation and observability.
Cloud & Infrastructure
Own cloud platform operations (AWS preferred).
Ensure secure, cost‑efficient, and resilient cloud infrastructure.
Drive platform upgrades, patching, and lifecycle management.
Collaborate with security teams on IAM, network security, and compliance.
Observability & Operations
Implement monitoring, logging, alerting, and tracing frameworks.
Ensure high signal‑to‑noise operational alerts.
Continuously improve MTTR, system stability, and operational maturity.
Leadership & Governance
Provide technical leadership and mentoring to DevOps/SRE engineers.
Define operating models, on‑call processes, and support structures.
Work with Product Owners and Architects on roadmap planning.
Act as escalation point for platform and reliability issues.
Skills
CI/CD: GitHub, GitLab, Azure DevOps, Jenkins
Cloud: AWS / Azure / GCP (strong in at least one)
Infrastructure as Code: Terraform, CloudFormation, ARM
Containers & Orchestration: Docker, Kubernetes
Observability: CloudWatch, Prometheus, Grafana, Datadog
Scripting: Python, Bash (or equivalent)
Critical success factors & key challenges
Platform uptime and reliability metrics (SLOs).
Domain team adoption rate of the self-serve platform.
Mean Time to Detection (MTTD) and Recovery (MTTR) for platform incidents.
Strong ownership mindset
Excellent collaboration and communication skills
Ability to balance speed, stability, and governance
Qualification & Experience
B.tech/M.tech grad
10–12+ years in DevOps, SRE, or Platform Engineering roles
3+ years in a technical lead or senior ownership role
Experience supporting production‑critical platforms at scale
Innovations
Employee may, as part of his/her role and maybe through multifunctional teams, participate in the creation and design of innovative solutions. In this context, Employee may contribute to inventions, designs, other work product, including know-how, copyrights, software, innovations, solutions, and other intellectual assets.