● INCIDENT P0
09:00:00 AM
📱 PagerDuty · now
🚨 P0 — CoverLine Production
Member portal returning 504 errors. Claims API unresponsive. 40,000 members affected.
09:14 AM — Open Enrollment Day 1
cluster nodes
node-1 READY
CPU22%
Memory1.2 GB
Pods5 / 8
node-2 READY
CPU18%
Memory1.0 GB
Pods4 / 8
node-3 READY
CPU25%
Memory1.3 GB
Pods3 / 8
pod health · default namespace
http 5xx errors / min
live logs
incident metrics
0.1% error rate
847
req / s
124ms
p95 latency
12
pods running
3
nodes
incident timeline
env: production
⚠ INCIDENT DURATION: 00:00
cluster: platform-eng-lab-will-gke · us-central1
POST-MORTEM
CoverLine Open Enrollment · November 9, 2023
45minutes down
40kmembers impacted
3enterprise clients called
✦ Phase 8 — What we built so this never happens again
Resource limits — no pod can starve its neighbours
Horizontal Pod Autoscaler — backend scales 2 → 8 pods on CPU spike
Cluster Autoscaler — new node provisions in <4 min, hands-free
Pod Disruption Budget — upgrades never kill more than 1 pod at once