Key Areas of Responsibilities
• Own and support monitoring and SRE operations, ensuring system reliability, availability, and performance.
• Build, enhance, and maintain monitoring solutions using ITRS Geneos, Prometheus, Victoria‑Metrics, Elasticsearch, and Grafana.
• Develop, optimize, and maintain alerting rules, dashboards, and observability pipelines.
• Troubleshoot and resolve complex issues during major incidents, providing clear and timely communication.
• Troubleshoot Linux servers (RHEL 7/8/9), including upgrades, configurations, patching, and maintenance, while determining appropriate monitoring requirements for system changes.
• Analyze logs, investigate issues, and perform fault finding to identify performance exceptions.
• Collaborate with engineering, application, and infrastructure teams to improve system resilience, stability, security, efficiency, and scalability.
• Contribute to automation strategies, deployment processes, and continuous operational improvements.
• Participate in on‑call rotations, including off‑hours and scheduled weekend support.
• Participate in Disaster Recovery (DR) and Business Continuity Planning (BCP) drills.
• Continuously research and adopt modern monitoring and SRE tools and practices.
Requirements
• Bachelor’s degree in computer science / engineering
• Minimum 8 years’ experience within IT / Investment bank.
• Strong experience with monitoring and observability platforms, including: ITRS Geneos, Prometheus, Victoria‑Metrics, Elasticsearch, Grafana, and Kibana.
• Hands-on experience building and implementing Prometheus pipelines, including exporters, scraping configurations, relabelling, metric routing, and integrations with long‑term storage (e.g., Victoria‑Metrics).
• Experience building and maintaining Logstash pipelines, including ingestion, parsing, filtering, enrichment, and routing of logs into Elasticsearch.
• Ability to design, build, and maintain Grafana and Kibana dashboards for metrics, logs, and performance analytics across distributed systems.
• Solid understanding of metrics, logging, alerting, dashboards, and observability pipelines.
• Strong Linux administration skills (RHEL 7/8/9), including troubleshooting, upgrades, configuration, patching, and performance optimization.
• Good understanding of SRE principles, high availability, scalability, incident management and DR (Disaster Recovery) / BCP (Business Continuity Planning) activities
• Experience with automation (e.g., Bash, Python, Ansible, CI/CD tools) is an advantage.
• Understanding of networking fundamentals, performance tuning, and troubleshooting distributed systems.
• Prior experience in Production Support, SRE, Monitoring Engineering, or Shared Services Operations with participation in on‑call rotations, including after-hours and weekend support.
• Strong analytical, problem‑solving and communication skills with the ability to work collaboratively under pressure.
• Self-motivated, adaptable and able to prioritize, learn continuously and manage multiple responsibilities effectively.
• Excellent/Fluent in English
Stay informed on CITIC CLSA Job Opportunities
Not the right fit? You can create a job alert to receive our latest job openings that meet your interest.