About SatSure
SatSure is a deep-tech decision intelligence company operating at the nexus of agriculture, infrastructure, and climate action. We turn earth observation data into actionable insights for governments, financial institutions, and enterprises across the developing world — at scale, with reliability.
Our platform team owns the infrastructure backbone that powers SatSure's AI/ML products: multi-cloud Kubernetes clusters, LLM inference pipelines, geospatial data platforms, and the internal developer tooling used by every engineering team. If you care about infrastructure quality and want your work to have real-world impact, this is the role.
About the Role
We are looking for a Senior DevOps & MLOps Engineer to join our Platform & DevOps team. You will design, build, and operate cloud-native infrastructure that supports ML model serving, data pipelines, and developer platforms across AWS, GCP, and Azure. You will work closely with data science, product engineering, and security teams — and be expected to own large surface areas end-to-end.
This is a hands-on senior IC role. You will architect systems, write Terraform and Helm, debug production incidents, define SLOs, and contribute to platform standards adopted org-wide.
Roles & Responsibilities
ML Platform & LLM Infrastructure
Own and operate Kubernetes-based ML platform on EKS — supporting LLM inference (KServe), distributed compute (Dask/Ray), and workflow orchestration (Apache Airflow).
Partner with data science and ML teams to design, deploy, and scale ML workloads — including GPU scheduling, autoscaling, resource isolation, and SLO-driven reliability.
Architect, deploy, and optimize Ray clusters on Kubernetes for distributed ML workloads — enabling scalable training, batch inference, and low-latency serving with efficient CPU/GPU utilization.
Multi-Cloud Platform & Infrastructure
Design, build, and maintain cloud-native infrastructure across AWS (primary), GCP, and Azure — using Kubernetes (EKS / GKE / AKS), Terraform, Helm, and ArgoCD.
Drive GitOps adoption and platform standardization — define reusable infrastructure patterns, Helm charts, and deployment workflows used across all product teams.
Manage Kubernetes platform operations — cluster lifecycle, Karpenter-based autoscaling, multi-tenancy, and workload isolation for data science and engineering teams.
Implement and maintain service mesh (Istio) — mTLS enforcement, traffic policies, and observability for inter-service communication.
Maintain and improve the internal developer platform (Backstage IDP) — enabling self-service environments, service catalog, and onboarding workflows for engineering teams.
Observability & Reliability Engineering
Build and maintain full-stack observability infrastructure — metrics (Prometheus / Mimir), logs (Loki), traces (Tempo), and dashboards (Grafana) integrated with OpenTelemetry instrumentation.
Define SLIs, SLOs, and error budget policies for production ML and platform services; lead incident response and post-mortem reviews.
Proactively identify reliability risks and drive engineering improvements to maintain 99.9%+ uptime targets.
FinOps & Cost Engineering
Implement Kubernetes cost attribution and chargeback using Kubecost / OpenCost — driving per-team visibility and FinOps decision-making for AI infrastructure.
Continuously optimize cloud spend through workload right-sizing, spot/preemptible usage, and resource scheduling strategies.
Platform Security & Governance
Manage AWS multi-account governance using Control Tower, SCPs, GuardDuty, and IAM Identity Center — ensuring security posture across all environments.
Own OIDC identity and SSO infrastructure integrated across internal tooling — Backstage, Airflow, and platform services.
Support compliance and audit processes — ISO 27001, CIS Benchmarks, Well-Architected Reviews, and VAPT assessments.
Requirements
Must Have
5+ years
of hands-on platform, DevOps, or SRE experience in production environments.
Strong Kubernetes expertise
— cluster operations, Helm, RBAC, autoscaling (Karpenter / Cluster Autoscaler), multi-tenancy; EKS experience preferred.
Infrastructure as Code
— Terraform (advanced), Ansible; experience managing large, multi-environment IaC codebases.
AWS expertise
— EC2, EKS, S3, RDS, IAM, VPC, CloudWatch, Control Tower, GuardDuty; GCP or Azure exposure is a plus.
GitOps & CI/CD
— ArgoCD, Bitbucket Pipelines / Jenkins, GitOps workflows at team scale.
Observability
— hands-on with Prometheus, Grafana, and at least one of: Loki, Tempo, OpenTelemetry, Datadog, or ELK.
Scripting & automation
— Python and Bash for tooling, automation, and platform integrations.
Strong understanding of networking, security, and cloud cost management in Kubernetes environments.
Nice to Have
Experience with ML serving infrastructure — KServe, vLLM, Ray Serve, or similar model serving frameworks.
Experience with Apache Airflow, Dask, or other data/ML pipeline orchestration at scale.
Familiarity with Backstage or similar internal developer platforms (IDP).
Istio or Envoy service mesh experience.
FinOps tooling — Kubecost, OpenCost, or cloud provider cost management tools.
OIDC / identity provider experience (Zitadel, Keycloak, or similar).
AWS Certified Solutions Architect or equivalent cloud certification.
Exposure to geospatial data workloads or satellite imagery pipelines.
Minimum Qualification
Bachelor's degree in Computer Science, Information Technology, or a related engineering discipline.
Our Stack
Kubernetes (EKS / GKE / AKS) · AWS · GCP · Azure · Terraform · Helm · ArgoCD · Istio · KServe · Apache Airflow · Dask · Backstage IDP · Prometheus · Grafana · Loki · Tempo · OpenTelemetry · Kubecost · Python · Bash
Why SatSure
Real Production Scale:
LLM inference, geospatial data pipelines, and multi-cloud Kubernetes — not toy projects.
High Ownership:
You architect systems end-to-end. No tickets-only culture, no hand-holding required.
Meaningful Impact:
Your infrastructure powers products used by governments and institutions across the developing world.
Growth & Benefits:
Learning allowances, broadband, medical insurance, best-in-class leave policy, and hybrid work from Bengaluru.