Position: SRE Site Reliability Engineering Developer
Location: London, UK (Hybrid 3 days onsite a week to office)
Duration: Full Time
Job Description:
Key Responsibilities:
The Monitoring and Observability team is responsible for managing:
Operating with a global footprint.
Collaborating across various organizations within Citi to understand and develop observability solutions for enterprise-wide deployment at scale.
Managing the Legacy monitoring stack across the Production Management organization within Citi.
Driving the strategic delivery of end-to-end Observability solutions in Citi.
Providing in-depth analysis with interpretive thinking to define problems and develop innovative solutions.
Directly impacting the business by influencing strategic functional decisions through advice, counsel, or provided services.
Persuading and influencing others through strong and comprehensive communication and diplomacy skills.
Performing other duties and functions as assigned.
Essential Skills:
OpenShift/Kubernetes Administration: Experience deploying, managing, and troubleshooting containerized applications on OpenShift/Kubernetes, including resource management and networking.
Grafana & Observability Stack:
o Proficiency in administering Geneos ITRS at scale.
o Proficiency in administering Grafana (user management, data sources, dashboards, alerts).
o Working knowledge of Grafana Back End components: Mimir (metrics), Loki (logs), and Tempo (traces).
o Experience with Prometheus for metric collection and PromQL for querying.
Helm Chart Management: Experience with Helm for deploying applications, including creating, modifying, and managing Helm charts, library charts, and dependencies.
Technical Documentation: Ability to create clear and concise documentation for systems and processes.
Desired Skills:
Application Deployment: Ability to deploy applications using Lightspeed Enterprise.
Google Cloud Operations: Experience with Google Cloud operations.
Scripting & Automation: Experience with Bash or Python Scripting for automating operational tasks.