CoverLine — Open Enrollment Incident

◈ CoverLine SRE · Dashboard

● INCIDENT P0

09:00:00 AM

📱 PagerDuty · now

🚨 P0 — CoverLine Production

Member portal returning 504 errors. Claims API unresponsive. 40,000 members affected.

09:14 AM — Open Enrollment Day 1

cluster nodes

node-1 READY

CPU22%

Memory1.2 GB

Pods5 / 8

node-2 READY

CPU18%

Memory1.0 GB

Pods4 / 8

node-3 READY

CPU25%

Memory1.3 GB

Pods3 / 8

pod health · default namespace

http 5xx errors / min

live logs

incident metrics

0.1% error rate

847

req / s

124ms

p95 latency

pods running

nodes

incident timeline

env: production

⚠ INCIDENT DURATION: 00:00

cluster: platform-eng-lab-will-gke · us-central1

POST-MORTEM

CoverLine Open Enrollment · November 9, 2023

45minutes down

40kmembers impacted

3enterprise clients called

✦ Phase 8 — What we built so this never happens again

✓Resource limits — no pod can starve its neighbours

✓Horizontal Pod Autoscaler — backend scales 2 → 8 pods on CPU spike

✓Cluster Autoscaler — new node provisions in <4 min, hands-free

✓Pod Disruption Budget — upgrades never kill more than 1 pod at once