About Signzy
Signzy is an AI-powered RPA platform for financial services. No matter how complex your workflow or operational complexity, Signzy can completely automate your back-operations decision-making process into a real-time API. This is possible due to a combination of Nebula - Our no-code AI model builder and our Fintech API Marketplace of over 200+ APIs. Today we work with over 90+ FIs globally including the 4 largest banks in India and a Top 3 acquiring Bank in the US. Globally we have a strong partnership with MasterCard and offices in New York and Dubai to serve our customers in the 2 geographies. Our Product team of 120+ people is building a global AI product out of Bangalore.
Working at Signzy
At Signzy we breathe software and exploit the latest technologies to create the most amazing products. We comprise a tech-savvy team and are backed by investors who are enthusiastic about creating solutions using technology.
This is an invitation to be a part of the future!
Role Overview
We are looking for a
Site Reliability Engineer (SRE-2)
to help design, operate, and improve reliable, scalable systems in cloud and Kubernetes environments. This role involves close collaboration with engineering and platform teams to automate operations, improve observability, and ensure production systems remain stable and performant as they scale.
You will work on infrastructure, deployment pipelines, and operational tooling while actively participating in incident response and long-term reliability improvements.
Responsibilities
Design, deploy, and operate reliable and scalable systems across cloud and Kubernetes environments.
Automate infrastructure provisioning, deployments, and operational workflows.
Build and maintain tools for deployment, monitoring, and system operations.
Monitor system health and performance, and proactively identify areas for improvement.
Troubleshoot and resolve issues across development, test, and production environments.
Participate in incident response, root cause analysis, and reliability improvements.
Collaborate with engineering teams to improve system operability and deployment safety.
Support and operate large-scale systems, including data-intensive or AI-driven workloads.
Requirements
3–5+ years of experience managing and operating
production infrastructure and services
in cloud environments such as AWS, Azure, or GCP.
Strong hands-on experience with
Linux systems
in production environments.
Experience working with
containerized workloads and Kubernetes
in real-world scenarios.
Working knowledge of
Infrastructure as Code
tools such as
Terraform, Terragrunt, or Crossplane
.
Experience designing and maintaining
CI/CD pipelines
using tools such as
GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or similar
.
Familiarity with
GitOps principles and tools
such as
Argo CD or Flux
.
Solid understanding of
cloud networking concepts
, load balancing, and service connectivity.
Experience with
monitoring, logging, and alerting systems
such as
Prometheus, Grafana, ELK/EFK, Datadog, or equivalent
.
Proficiency in at least one
scripting or programming language
(e.g., Bash, Python).
Experience working with
relational databases
; exposure to NoSQL or data platforms is a plus.
Experience participating in
on-call rotations
, responding to production incidents, and performing root cause analysis.
Understanding of
SLIs, SLOs, and error budgets
, and how they are used to guide reliability and operational decisions.
Strong problem-solving skills and the ability to debug complex production issues.
Good verbal and written communication skills, especially during incidents and technical discussions.
Nice to Have
Experience operating systems at scale or in high-availability environments.
Exposure to on-prem or hybrid infrastructure.
Experience supporting data platforms, analytics, or AI/ML workloads.
What We Value
A strong sense of
ownership
and responsibility for production systems.
A focus on
automation, reliability, and operational simplicity
.
The ability to balance speed, stability, and long-term maintainability.
Curiosity and willingness to continuously improve systems and processes.