This role can be based out of our London or Leicester offices but will be hybrid (i.e. work from home and office).
We’re searching for an Engineer to join our Site Reliability Engineering (SRE) team. The team is agile, data-driven, and motivated, comprising software and systems engineers. We’re dedicated to creating meaningful observability and monitoring solutions, while automating manual tasks, ensuring product quality and forging operational excellence. Our blame-free DevOps culture and collaboration is at the core of our approach.
What you’ll be doing
As a Site Reliability Engineer at Dunelm, you will become a key member of the SRE team. You are motivated and enthusiastic and able to use your operational and engineering knowledge to help develop effective tools, observability solutions, pipelines and more to allow the wider engineering and platform teams at Dunelm the ability to create, update and release with confidence.
Responsibilities:
Observability Development: Designing, building, deploying, running and – ultimately – owning observability tooling, such as dashboards, monitoring and alerting.
Embedded Consultancy: Working with other teams throughout the Dunelm technical space to help increase their SRE maturity level – mainly through helping to define Service Level Objectives (SLO) and Service Level Indicators (SLI), plus working on the integrations to help them produce the required telemetry for them.
DevOps Best-Practice Advocacy: Promoting a DevOps culture, through ‘shift-left’ testing, non-functional (security, performance etc.) testing, continuous integration and deployment and working with other teams to share the responsibilities of the software that is built.
Incident Response: Being available to help investigate incidents in real-time – sometimes out of normal working hours as part of our on-call rota. Helping to ascertain impact and find observability gaps during these investigations. Being part of ‘blameless post-mortems’, focusing on collaboration and knowledge-sharing.
Workflow: Helping to create and refine work tickets, breaking down larger pieces of work into actionable pieces. Ensuring relevant knowledge is shared with the rest of the team while working on these tickets and clearly articulating any blocking circumstances.
Code Quality and Risk Mitigation: Review the team’s output to ensure all code is highly maintainable, supportable, and minimises operational risk.
Mentorship and Coaching: Mentor and guide other team members, including less experienced engineers, providing feedback and coaching to help them reach their full potential.
Research and Learning: Researching new technologies and architectural patterns by conducting technical Proof of Concepts (PoC) and propose improvements to existing platforms as well as developing new solutions. You will also be given the opportunity to do team-based and independent learning on a regular basis, to improve yours and the team’s knowledge.
What we’ll look for in you
Essential skills
Amazon Web Services: We run most of our back-end software in AWS Lambdas, with some containerised software (ECS / Fargate) and some cloud server based (EC2). You will need experience and good knowledge of all of these, general AWS networking principles (VPC, security groups etc.), plus other AWS services including, but not limited to: S3, EventBridge, SQS / SNS, RDS and DynamoDB.
Development Experience: You will be a solid developer, experienced in building high-quality, testable applications and tools. You will know how to create effective tests (unit and integration) and be familiar and comfortable with different ways of tackling a problem – for example pair and mob programming. Our stack is mainly TypeScript and Python, so experience with both would be distinctly advantageous.
Observability Knowledge: The fundamental aspects of observability, including telemetry and RUM are something you can explain in detail, and you know how to use them to effectively observe running software. You also understand sampling and how that can be used most effectively.
Problem-Solving Prowess: You possess exceptional problem-solving skills, capable of addressing intricate challenges that may arise within our technology landscape. You are also a ‘detective’, using your skills and knowledge to collect evidence that eliminates what a problem is not, leading you to the most likely cause(s).
Pipeline Expertise: You will understand how to create, deploy and troubleshoot CI / CD pipelines (we use Gitlab) to run tests / checks, create builds and ultimately deploy software in various environments.
Technology Proficiency: You have solid knowledge of various technologies, tools, frameworks and patterns related to the previous five points, including, but not limited to: IaC (e.g. Terraform, Pulumi, CDK), Lambda runtime programming languages (e.g. TypeScript, Python), containerised applications (Docker), event-driven architecture and POSIX-based shells (e.g. Bash, zsh).
Tech Enthusiasm: Your passion for technology drives you to explore and embrace the latest innovations continuously. This dedication to growth and learning is essential in staying ahead in our ever-evolving tech landscape.
Desirable skills
OpenTelemetry: Previous experience or demonstrated knowledge of working with OpenTelemetry solidifies your observability expertise, enhancing our monitoring and diagnostic capabilities.
Google Cloud Platform (GCP): Although Dunelm’s distributed systems are overwhelmingly deployed on AWS, we do have strategic deployments in GCP, so any working knowledge of this platform would demonstrate your breadth of cloud knowledge.