During incident

  • Roles: incident commander, comms lead, ops/feature SMEs, scribe.
  • Declare severity quickly; open shared channel/bridge; timestamp actions.
  • Stabilize first: roll back, feature-flag off, scale up, or shed load.

Runbooks & tooling

  • Prebuilt runbooks per service: restart/rollback steps, dashboards, logs, feature flags.
  • One-click access to dashboards (metrics, traces, logs), recent deploys, and toggles.
  • Paging rules with escalation; avoid noisy alerts.

Comms

  • Single source of truth: incident doc; external status page if needed.
  • Regular updates with impact, scope, mitigation, ETA.

After incident

  • Blameless postmortem; timeline, root causes, contributing factors.
  • Action items with owners/deadlines; track to completion.
  • Add tests/alerts/runbook updates; reduce time-to-detect and time-to-recover.