Why this matters
Incidents are unavoidable. How you handle them determines customer trust, uptime, and the team's stress levels. Platform Engineers are often first responders: you triage, mitigate, coordinate, and learn. Solid incident practices shorten downtime, reduce impact, and prevent repeats.
- Real tasks: on-call response, service rollback, failover, paging the right owners, posting customer updates, coordinating engineers, and leading post-incident reviews.
- Outcomes: faster time-to-detect (TTD), time-to-mitigate (TTM), clearer communications, and actionable improvements.
Concept explained simply
Handling incidents is structured problem-solving under pressure. You do the minimum effective action to reduce impact, tell people what's happening, then dig deeper once things are stable.
Mental model: "Stabilize, then analyze"
- Stabilize: Stop the bleeding. Roll back, fail over, rate-limit, or disable a feature flag.
- Communicate: Set expectations. Say what's impacted, what you're doing, and when you'll update next.
- Analyze: After stability, find contributing factors and fix them for good.
Core workflow
- Detect – Alerts or reports indicate a problem. Verify signal with a quick dashboard glance.
- Declare and classify – Create an incident, set a severity (e.g., Sev1–Sev4), open an incident channel, and assign roles.
- Triage – Isolate scope (which services/regions), identify recent changes, and form a hypothesis.
- Mitigate – Apply safe, reversible actions to reduce impact. Prefer rollbacks and config toggles.
- Communicate – Post clear updates internally and externally with next-update time.
- Recover – Return to normal state with checks and validation.
- Learn – Capture timeline, contributing factors, and follow-ups with owners and deadlines.
What a good severity matrix looks like
- Sev1: Critical customer impact (widespread outage, data loss risk). On-call + leadership paged, updates every 15–30 min.
- Sev2: Major degradation (significant errors/latency). On-call engaged, updates every 30–60 min.
- Sev3: Moderate impact (partial, workarounds exist). Normal business hours, updates every 2–4 h.
- Sev4: Low impact (minor bug, no immediate harm). Track and fix.
Worked examples
Example 1: High 5xx after deploy
- Detect: Alert says 5xx spiked to 12% within 2 min of deploy.
- Declare: Incident created as Sev2. Roles assigned: Incident Commander (IC), Communications (Comms), Ops.
- Triage: Correlate with deployment timeline; logs show null pointer in new code path.
- Mitigate: Roll back to previous version. Confirm error rate drops to baseline.
- Communicate: Initial note: "Increased errors after a recent deploy. Rolling back. Next update in 20 min."
- Recover: Post-rollback verification checks pass; traffic normal.
- Learn: Action items: add integration test for null case, canary rollout for that service, update runbook.
Example 2: Region outage at provider
- Detect: Elevated latency and timeouts from one region.
- Declare: Sev1 due to multi-service impact.
- Triage: Health checks failing only in region A; provider dashboard confirms issues.
- Mitigate: Shift traffic away from region A via load balancer policy. Scale up healthy regions.
- Communicate: "We see regional provider issues. Traffic is being rerouted. Next update in 15 min."
- Recover: Keep traffic away until provider resolves; gradually restore.
- Learn: Add automatic regional failover; test quarterly.
Example 3: Database saturation
- Detect: P95 latency spikes; DB CPU hits 95%.
- Declare: Sev2; potential customer degradation.
- Triage: Recent feature doubles query volume; missing index found.
- Mitigate: Temporarily enable request rate-limiting; scale read replicas.
- Communicate: "Elevated latency due to database load. Rate-limiting applied, capacity increased. Next update in 20 min."
- Recover: Off-peak apply index; remove rate-limits gradually.
- Learn: Add pre-merge query plan checks; capacity alarms tuned.
Tooling essentials (agnostic)
- On-call and paging: escalation policies, timeouts, backup rotations.
- Runbooks: first 5 minutes, rollback steps, failover steps, known issues.
- Dashboards: golden signals (latency, traffic, errors, saturation) and per-service health.
- SLOs and error budgets: guide severity and change freezes.
- Incident channel template: auto-create channel with pinned checklist.
- Post-incident template: timeline, contributing factors, action items with owners and due dates.
Copy-paste: First 5 minutes checklist
- 1) Acknowledge page and declare incident with severity.
- 2) Assign roles (IC, Comms, Ops, SME) and open incident channel.
- 3) Check golden signals; verify scope.
- 4) Identify and roll back last risky change (if applicable).
- 5) Post initial update with next update time.
Communication templates
Incident channel kickoff
INC-#### | Severity: Sev2 | Owner (IC): Name Impact: Users seeing 5xx in API Start time: 14:05 UTC Next update: 14:25 UTC Roles: IC, Comms, Ops, SME Actions underway: rollback of service X
Internal update
Update at 14:10 UTC: Elevated 5xx tied to service X deploy. Rollback in progress. Impact: ~10% API requests failing. Next update 14:25 UTC.
Customer-facing note
We're addressing increased errors affecting some API requests. A fix is being applied. We will provide another update by 14:30 UTC.
Exercises
Do these hands-on tasks. They mirror the exercises below.
- ex1 — Create a one-page incident runbook stub
- Define severity levels, first 5 minutes checklist, rollback steps for one service, escalation contacts, and a status update template.
- ex2 — Triage a sudden 5xx spike
- Given metrics and a recent change, decide the likely cause, immediate mitigation, next diagnostics, and write a first update message.
Self-check checklist
- Did you include a clear severity matrix with update cadences?
- Is your first 5 minutes checklist executable by any on-call, not just experts?
- Are rollback steps safe, reversible, and tested?
- Does your status template include impact, action, and next update time?
- For the 5xx spike, did you consider recent deploys first?
Common mistakes and how to self-check
- Delaying incident declaration. Fix: If in doubt, declare low severity; you can downgrade later.
- Chasing root cause before mitigation. Fix: Stabilize first with safe actions.
- Vague communications. Fix: Always include impact, action, and next update time.
- No ownership. Fix: Assign Incident Commander and Comms explicitly.
- Skipping timeline. Fix: Jot timestamps as you go; you'll need them later.
- Over-correcting with risky changes. Fix: Prefer rollbacks, toggles, and traffic shifts.
Quick self-audit
- Can a new on-call follow your runbook without context?
- Do your dashboards show golden signals prominently?
- Do incidents have clear closure criteria and follow-ups with owners?
Practical projects
- Build a service runbook: include severity matrix, first 5 minutes, rollback and failover steps, and comms templates.
- Run a 30-minute game day: simulate a bad deploy, practice rollback, write two updates, and capture a timeline.
- Create a "last 24h changes" dashboard: deploys, config toggles, and schema changes to speed triage.
Learning path
- Define severities and communication cadences for your team.
- Create and test a first 5 minutes checklist.
- Add rollback and feature flag procedures to runbooks.
- Instrument golden signals and connect alerts to on-call.
- Practice with a monthly incident drill and iterate on gaps.
Who this is for
- Platform Engineers and SREs participating in on-call.
- Backend Engineers who own services in production.
- Engineering Managers who coordinate incident response.
Prerequisites
- Basic understanding of your stack (HTTP, services, databases, queues).
- Familiarity with monitoring/alerting and dashboards.
- Ability to deploy/rollback (CI/CD and feature flags).
Next steps
- Expand runbooks to include disaster recovery and regional failover.
- Introduce canary releases and progressive delivery to reduce blast radius.
- Adopt blameless post-incident reviews with clear action owners and deadlines.
Mini challenge
In 15 minutes, draft a severity matrix and a first 5 minutes checklist for one service you own. Keep it to one page. Share it with a teammate to validate.
Quick Test
Take the quick test below to check your understanding. Everyone can take it for free. Progress is saved only for logged-in users.