During incident
- Roles: incident commander, comms lead, ops/feature SMEs, scribe.
- Declare severity quickly; open shared channel/bridge; timestamp actions.
- Stabilize first: roll back, feature-flag off, scale up, or shed load.
Runbooks & tooling
- Prebuilt runbooks per service: restart/rollback steps, dashboards, logs, feature flags.
- One-click access to dashboards (metrics, traces, logs), recent deploys, and toggles.
- Paging rules with escalation; avoid noisy alerts.
Comms
- Single source of truth: incident doc; external status page if needed.
- Regular updates with impact, scope, mitigation, ETA.
After incident
- Blameless postmortem; timeline, root causes, contributing factors.
- Action items with owners/deadlines; track to completion.
- Add tests/alerts/runbook updates; reduce time-to-detect and time-to-recover.
