Lead Observability: Evolve how we collect and process telemetry using OpenTelemetry. Focus on actionable, cost-effective metrics and traces.
Define Reliability Standards: Help teams adopt SLIs/SLOs and build a culture of service ownership.
Incident Response: Take charge during incidents, lead post-incident reviews, and drive long-term fixes.
Infrastructure Automation: Use tools like Pulumi, Terraform, and CDK to manage AWS infrastructure and CI/CD pipelines.
Write Code: Build tooling and automation using TypeScript (and optionally Rust). Contribute to shared libraries and platforms.
Mentor and Collaborate: Guide other engineers, share knowledge, and help grow the SRE mindset.
Drive Innovation: Explore new tools and methods in observability and reliability. Lead proof-of-concepts and continuous improvement.
Strong experience in TypeScript or similar languages.
Solid understanding of OpenTelemetry and modern observability tools (e.g. Datadog, Grafana).
Deep knowledge of SRE principles: SLIs/SLOs, automation, monitoring, and incident response.
Hands-on experience with AWS (e.g. Lambda, ECS, S3, DynamoDB).
Comfortable with Linux, command line, and system debugging.
Experience with infra-as-code tools like Pulumi or Terraform.
Familiar with Kubernetes, CI/CD pipelines, and automated deployments.
Strong problem-solving skills and attention to detail.
Experience with Rust or Go.
Deep understanding of trace sampling and OpenTelemetry at scale.
Experience reducing observability/cloud costs.
Exposure to Google Cloud Platform.
Familiarity with retail tech challenges is a bonus.
Take ownership and build trust.
Communicate openly and support your team.
Stay curious and always look to improve.
Embrace change and challenges.
Think long-term and work collaboratively.