without observability
flying blind
backend response time (ms)
42ms
02:14 backend pods timeout — claims processing down
02:14 no metrics. no alert. no page.
02:15 on-call engineer: asleep
03:00 still down. still unknown.
06:14 issue self-resolves. no one knows why.
member-support · #general
Hey, I got an email from a member saying claims have been broken since 2am? Is anyone looking at this?
06:31
⚠ 4h outage · found out from a customer · full day post-mortem
with prometheus + loki + alertmanager
monitoring
backend pods available
2 / 2
coverline-backend
error rate
0.2%
last 5 minutes
🔴 BackendDown — FIRING · 0 pods available · severity: critical
PagerDuty · #oncall-alerts
ALERT: BackendDown FIRING. CoverLine backend 0/2 pods. Severity: critical. Assigned: karim
02:15
02:15 karim paged — opens Grafana on phone
02:17 Loki: OOMKilled — backend exceeded 256Mi limit
02:19 kubectl set resources — memory limit → 512Mi
02:23 pods recovered — claims processing restored
Grafana AlertManager · #oncall-alerts
✅ RESOLVED: BackendDown. Duration: 9 minutes. Root cause: OOMKill (256Mi limit).
02:23
✓ 9 min downtime · paged at 02:15 · root cause found in Loki in 2 min