Job Title: Site Reliability Engineering (SRE)
Location : Preferred Onsite ,Remote is OK
Experience:–2-5 Years
Technical Qualifications
Must-Have Skills
Experience:
2-5 years in SRE, DevOps, or Systems Engineering roles with a strong focus on
AWS
.
Cloud Proficiency:
Expert-level knowledge of AWS core services and architecture standards.
Scripting:
Strong proficiency in
Python
or
Shell/Bash
for automation.
Cost Tools:
Experience with AWS Cost Explorer, Trusted Advisor, or 3rd party tools (e.g., Cloud Health) to drive financial efficiency.
Monitoring:
Hands-on experience with tools like Grafana, Prometheus, ELK Stack, or Splunk.
Preferred Qualifications
Experience in Hybrid Cloud environments (AWS + On-Prem/Data Center).
Knowledge of container orchestration (Kubernetes/EKS).
Understanding of database administration and replication (PostgreSQL, MySQL, or DynamoDB).
System Ownership & Reliability
End-to-End Ownership:
Own the health and lifecycle of production systems, ensuring high availability (HA) and meeting strict Service Level Objectives (SLOs).
Deep-Dive Debugging:
Troubleshoot and resolve complex issues across infrastructure, application code, and networking layers. You will be the escalation point for hard-to-solve production incidents.
Incident Management:
Lead Root Cause Analysis (RCA) processes for outages, driving permanent fixes and architectural changes to prevent recurrence.
Operational Excellence & Security
Disaster Recovery (DR):
Design and manage DR strategies; conduct periodic failover drills to ensure business continuity.
Security & Compliance:
Oversee OS patching, vulnerability scanning, and adherence to industry compliance standards (SOC2/HIPAA/ISO). Maintain strict IAM policies and security groups.
Observability:
Build and maintain comprehensive monitoring, logging, and alerting frameworks (CloudWatch, Prometheus, Datadog) to ensure early detection of anomalies.
Maintenance:
Define and maintain backup/restore processes and routine maintenance windows with minimal downtime.
SRE & Automation
Eliminate Toil:
Apply SRE principles to automate repetitive operational tasks, reducing manual intervention.
IaC & Tooling:
Develop automation tools and manage infrastructure using
Terraform
or
CloudFormation
, along with scripting in
Python
,
Go
, or
Bash
.
Self-Healing Systems:
Implement auto-remediation workflows where systems can detect and resolve common issues (e.g., restarting failed services, rotating bad nodes) without human intervention.
Performance Tuning:
optimize application runtime parameters, database queries, and system kernel settings for maximum throughput.
Cloud & Cost Optimization (FinOps)
AWS Management:
Architect and manage extensive AWS services—EC2, EKS/ECS, RDS, S3, Lambda, VPC, and Route53.
Cost Efficiency:
Actively monitor cloud spend and drive
Cost Optimization
initiatives. This includes rightsizing instances, managing Reserved/Spot instances, and identifying idle resources to reduce waste.
Capacity Planning:
Collaborate with engineering teams to forecast infrastructure needs, ensuring we scale to meet demand without over-provisioning.
Work Environment & Soft Skills
Global Flexibility:
We work with clients across
IST, GMT, and EST
time zones. You must be flexible with your working hours to accommodate project-specific deployments, overlapping meetings, or on-call rotations.
Team Player:
Willingness to help out with other cloud-related workloads (even outside your primary AWS focus) when the team is under pressure.
Detective Mindset:
You are relentless when debugging and won't stop until you find the root cause.
Financial Awareness:
You treat cloud resources as real money and take pride in running a lean, efficient infrastructure.