Observability Engineering Lead
<b>Requirements:</b>
<ul><li>Deep expertise in designing, implementing, and configuring modern observability stacks - specifically Prometheus, Grafana, and associated tooling.</li><li>Strong instrumentation strategy (exporters, service discovery, custom metrics).</li><li>Advanced PromQL skills for complex querying and performance analysis.</li><li>Experience building recording/alerting rules and optimizing metric ingestion.</li><li>Knowledge of HA architectures, federation, sharding, and long-term storage (Thanos, Cortex, Mimir).</li><li>Grafana Dashboard and panel design focused on performance and operator clarity.</li><li>Best-practice alert configuration and routing.</li><li>Experience with synthetic monitoring (Grafana Synthetic Monitoring, Blackbox exporter).</li><li>Log ingestion/analysis (Loki).</li><li>Familiarity with Real User Monitoring tooling (e.g., Grafana Faro).</li><li>Strong API and automation skills for dashboard provisioning, alert management, and data ingestion.</li><li>Experience integrating the Grafana/Prometheus ecosystem with logging, tracing, and event platforms (Loki, Tempo, OpenTelemetry).</li></ul>
<b>Responsibilities:</b>
<ul><li>Drive the uplift, resilience, and effectiveness of our monitoring ecosystem.</li><li>Partner with engineering teams to deliver world-class insights through metrics, dashboards, alerts, and automation.</li><li>Influence standards, modernise tooling, and enhance visibility across complex distributed systems.</li><li>Collaborate with Application Stewards and SREs to validate critical assets for monitoring verification and uplift.</li><li>Analyse Prometheus scrape coverage, exporter deployment, and Grafana dashboard availability for critical services.</li><li>Identify and implement improvements across monitoring configurations, alert quality, data models, dashboards, KPIs, SLIs, and SLOs.</li><li>Review roles and responsibilities across observability functions and recommend enhancements aligned to Operational Resilience standards.</li><li>Contribute to delivering automated, end-to-end business flow visibility, surfaced in Grafana through service maps, dependency visualisation, or topology integrations.</li><li>Ensure alerting configurations are reliable, actionable, and noise-optimised, following Alertmanager best practices.</li></ul>
<b>Technologies:</b>
<ul><li>API</li><li>Flow</li><li>Grafana</li><li>OpenTelemetry</li><li>Prometheus</li><li>DevOps</li></ul>
<p><b>More:</b></p>
<p>We are seeking a highly skilled Observability Engineering Lead to shape how we detect, diagnose, and prevent issues across our critical applications. This hands-on technical leadership position allows you to play a pivotal role in our team, working onsite for 2 days a week. We are committed to fostering an inclusive environment as we strive for excellence in our monitoring practices.</p>
<p>last updated 8 week of 2026</p>