Job Description Summary
As a Staff Software Engineer (Observability), you will be responsible for defining and implementing the observability strategy across PCS Digital Solutions Cloud Applications.
Job Description
Roles and Responsibilities
In this role, you will:
Define and evolve the
observability vision and roadmap
for PCS DS applications
Design and implement/integrate
standardized observability frameworks
(metrics, logs, traces, events, profiling).
Collaborate with platform, SRE, and product teams to
instrument services
using OpenTelemetry and other modern observability tooling.
Build and maintain
dashboards, alerts, and SLOs
that reflect both technical and business health indicators.
Evaluate, integrate, and optimize observability agents (e.g., Prometheus, Fluent bit, OTEL and other agents).
Design self-remediation solutions leveraging observability tooling.
Implement Best Practices for using GenAI tools of Observability platforms.
Lead / contribute to
incident analysis and postmortem reviews
, driving improvements in system resilience and observability coverage.
Conduct Operational Readiness Reviews (ORRs) to validate monitoring, alerting, and rollback strategies before go-live.
Ensure observability practices align with
healthcare compliance standards
(e.g., HIPAA, GDPR, HITRUST).
Mentor engineers and promote a
culture of observability-first development
.
Required Qualifications
Bachelor’s or master’s degree in computer science, Engineering, or a related technical field.
10+ years of experience in software engineering, SRE, or platform engineering roles.
4+ years of experience in contributing in
observability solutions in cloud-native environments
(Kubernetes, microservices, serverless).
Deep expertise in
observability pillars
(metrics, logs, traces) and tools like OpenTelemetry, Prometheus, Grafana, Datadog, Dynatrace etc.
Strong programming/scripting skills (e.g., Go, Python, Bash, Terraform).
Experience with
distributed tracing
,
SLO/SLI frameworks
, and
incident response workflows
.
Deep expertise in
distributed systems, microservices, and cloud platforms
(AWS, Azure, GCP).
Experience with
AI-powered anomaly detection
, automated incident response, and cost optimization for observability at scale.
Familiarity with
SRE practices
, chaos engineering
Excellent communication and collaboration skills.
Desired Characteristics
Experience in
healthcare or regulated industries
.
Knowledge of
data privacy and compliance
(HIPAA, HITRUST).
Experience with
cost optimization
and
telemetry data governance
.
Contributions to open-source observability projects.
Additional Information
Relocation Assistance Provided:
No