50,000 members · Tuesday 02:14
02:14:00
without observability flying blind
backend response time (ms)
42ms
02:14 backend pods timeout — claims processing down
02:14 no metrics. no alert. no page.
02:15 on-call engineer: asleep
03:00 still down. still unknown.
06:14 issue self-resolves. no one knows why.
member-support · #general
Hey, I got an email from a member saying claims have been broken since 2am? Is anyone looking at this?
06:31
⚠ 4h outage · found out from a customer · full day post-mortem
with prometheus + loki + alertmanager monitoring
backend pods available
2 / 2
coverline-backend
error rate
0.2%
last 5 minutes
🔴 BackendDown — FIRING · 0 pods available · severity: critical
PagerDuty · #oncall-alerts
ALERT: BackendDown FIRING. CoverLine backend 0/2 pods. Severity: critical. Assigned: karim
02:15
02:15 karim paged — opens Grafana on phone
02:17 Loki: OOMKilled — backend exceeded 256Mi limit
02:19 kubectl set resources — memory limit → 512Mi
02:23 pods recovered — claims processing restored
Grafana AlertManager · #oncall-alerts
✅ RESOLVED: BackendDown. Duration: 9 minutes. Root cause: OOMKill (256Mi limit).
02:23
✓ 9 min downtime · paged at 02:15 · root cause found in Loki in 2 min
THE SAME OUTAGE. TWO OUTCOMES.
Without observability, the team woke up at 6:30 AM to a customer email about a 4-hour outage. With Prometheus + Loki + Alertmanager, the engineer was paged within 60 seconds and the root cause was in the logs 2 minutes later.
Without
4h 00m
total downtime
customer email
how the team found out
With
9m
total downtime
PagerDuty
how the team found out
BackendDown PrometheusRule fired within 60 seconds of pods going down
Loki LogQL query identified OOMKill as root cause in 2 minutes
Fix deployed, alert resolved — total downtime 9 minutes instead of 4 hours
env: production
phase: 6 — observability stack