CoverLine — Phase 6 · The 4-Hour Silent Outage

without observability flying blind

backend response time (ms)

42ms

02:14 backend pods timeout — claims processing down

02:14 no metrics. no alert. no page.

02:15 on-call engineer: asleep

03:00 still down. still unknown.

06:14 issue self-resolves. no one knows why.

member-support · #general

Hey, I got an email from a member saying claims have been broken since 2am? Is anyone looking at this?

06:31

⚠ 4h outage · found out from a customer · full day post-mortem

with prometheus + loki + alertmanager monitoring

backend pods available

2 / 2

coverline-backend

error rate

0.2%

last 5 minutes

🔴 BackendDown — FIRING · 0 pods available · severity: critical

PagerDuty · #oncall-alerts

ALERT: BackendDown FIRING. CoverLine backend 0/2 pods. Severity: critical. Assigned: karim

02:15

02:15 karim paged — opens Grafana on phone

02:17 Loki: OOMKilled — backend exceeded 256Mi limit

02:19 kubectl set resources — memory limit → 512Mi

02:23 pods recovered — claims processing restored

Grafana AlertManager · #oncall-alerts

✅ RESOLVED: BackendDown. Duration: 9 minutes. Root cause: OOMKill (256Mi limit).

02:23

✓ 9 min downtime · paged at 02:15 · root cause found in Loki in 2 min