Menu

Topic 4 of 8

Handling Incidents And Outages

Learn Handling Incidents And Outages for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

Incidents are unavoidable. How you handle them determines customer trust, uptime, and the team's stress levels. Platform Engineers are often first responders: you triage, mitigate, coordinate, and learn. Solid incident practices shorten downtime, reduce impact, and prevent repeats.

  • Real tasks: on-call response, service rollback, failover, paging the right owners, posting customer updates, coordinating engineers, and leading post-incident reviews.
  • Outcomes: faster time-to-detect (TTD), time-to-mitigate (TTM), clearer communications, and actionable improvements.

Concept explained simply

Handling incidents is structured problem-solving under pressure. You do the minimum effective action to reduce impact, tell people what's happening, then dig deeper once things are stable.

Mental model: "Stabilize, then analyze"

  • Stabilize: Stop the bleeding. Roll back, fail over, rate-limit, or disable a feature flag.
  • Communicate: Set expectations. Say what's impacted, what you're doing, and when you'll update next.
  • Analyze: After stability, find contributing factors and fix them for good.

Core workflow

  1. Detect – Alerts or reports indicate a problem. Verify signal with a quick dashboard glance.
  2. Declare and classify – Create an incident, set a severity (e.g., Sev1–Sev4), open an incident channel, and assign roles.
  3. Triage – Isolate scope (which services/regions), identify recent changes, and form a hypothesis.
  4. Mitigate – Apply safe, reversible actions to reduce impact. Prefer rollbacks and config toggles.
  5. Communicate – Post clear updates internally and externally with next-update time.
  6. Recover – Return to normal state with checks and validation.
  7. Learn – Capture timeline, contributing factors, and follow-ups with owners and deadlines.
What a good severity matrix looks like
  • Sev1: Critical customer impact (widespread outage, data loss risk). On-call + leadership paged, updates every 15–30 min.
  • Sev2: Major degradation (significant errors/latency). On-call engaged, updates every 30–60 min.
  • Sev3: Moderate impact (partial, workarounds exist). Normal business hours, updates every 2–4 h.
  • Sev4: Low impact (minor bug, no immediate harm). Track and fix.

Worked examples

Example 1: High 5xx after deploy
  1. Detect: Alert says 5xx spiked to 12% within 2 min of deploy.
  2. Declare: Incident created as Sev2. Roles assigned: Incident Commander (IC), Communications (Comms), Ops.
  3. Triage: Correlate with deployment timeline; logs show null pointer in new code path.
  4. Mitigate: Roll back to previous version. Confirm error rate drops to baseline.
  5. Communicate: Initial note: "Increased errors after a recent deploy. Rolling back. Next update in 20 min."
  6. Recover: Post-rollback verification checks pass; traffic normal.
  7. Learn: Action items: add integration test for null case, canary rollout for that service, update runbook.
Example 2: Region outage at provider
  1. Detect: Elevated latency and timeouts from one region.
  2. Declare: Sev1 due to multi-service impact.
  3. Triage: Health checks failing only in region A; provider dashboard confirms issues.
  4. Mitigate: Shift traffic away from region A via load balancer policy. Scale up healthy regions.
  5. Communicate: "We see regional provider issues. Traffic is being rerouted. Next update in 15 min."
  6. Recover: Keep traffic away until provider resolves; gradually restore.
  7. Learn: Add automatic regional failover; test quarterly.
Example 3: Database saturation
  1. Detect: P95 latency spikes; DB CPU hits 95%.
  2. Declare: Sev2; potential customer degradation.
  3. Triage: Recent feature doubles query volume; missing index found.
  4. Mitigate: Temporarily enable request rate-limiting; scale read replicas.
  5. Communicate: "Elevated latency due to database load. Rate-limiting applied, capacity increased. Next update in 20 min."
  6. Recover: Off-peak apply index; remove rate-limits gradually.
  7. Learn: Add pre-merge query plan checks; capacity alarms tuned.

Tooling essentials (agnostic)

  • On-call and paging: escalation policies, timeouts, backup rotations.
  • Runbooks: first 5 minutes, rollback steps, failover steps, known issues.
  • Dashboards: golden signals (latency, traffic, errors, saturation) and per-service health.
  • SLOs and error budgets: guide severity and change freezes.
  • Incident channel template: auto-create channel with pinned checklist.
  • Post-incident template: timeline, contributing factors, action items with owners and due dates.
Copy-paste: First 5 minutes checklist
  • 1) Acknowledge page and declare incident with severity.
  • 2) Assign roles (IC, Comms, Ops, SME) and open incident channel.
  • 3) Check golden signals; verify scope.
  • 4) Identify and roll back last risky change (if applicable).
  • 5) Post initial update with next update time.

Communication templates

Incident channel kickoff
INC-#### | Severity: Sev2 | Owner (IC): Name
Impact: Users seeing 5xx in API
Start time: 14:05 UTC
Next update: 14:25 UTC
Roles: IC, Comms, Ops, SME
Actions underway: rollback of service X
Internal update
Update at 14:10 UTC: Elevated 5xx tied to service X deploy. Rollback in progress. Impact: ~10% API requests failing. Next update 14:25 UTC.
Customer-facing note
We're addressing increased errors affecting some API requests. A fix is being applied. We will provide another update by 14:30 UTC.

Exercises

Do these hands-on tasks. They mirror the exercises below.

  1. ex1 — Create a one-page incident runbook stub
    • Define severity levels, first 5 minutes checklist, rollback steps for one service, escalation contacts, and a status update template.
  2. ex2 — Triage a sudden 5xx spike
    • Given metrics and a recent change, decide the likely cause, immediate mitigation, next diagnostics, and write a first update message.
Self-check checklist
  • Did you include a clear severity matrix with update cadences?
  • Is your first 5 minutes checklist executable by any on-call, not just experts?
  • Are rollback steps safe, reversible, and tested?
  • Does your status template include impact, action, and next update time?
  • For the 5xx spike, did you consider recent deploys first?

Common mistakes and how to self-check

  • Delaying incident declaration. Fix: If in doubt, declare low severity; you can downgrade later.
  • Chasing root cause before mitigation. Fix: Stabilize first with safe actions.
  • Vague communications. Fix: Always include impact, action, and next update time.
  • No ownership. Fix: Assign Incident Commander and Comms explicitly.
  • Skipping timeline. Fix: Jot timestamps as you go; you'll need them later.
  • Over-correcting with risky changes. Fix: Prefer rollbacks, toggles, and traffic shifts.
Quick self-audit
  • Can a new on-call follow your runbook without context?
  • Do your dashboards show golden signals prominently?
  • Do incidents have clear closure criteria and follow-ups with owners?

Practical projects

  • Build a service runbook: include severity matrix, first 5 minutes, rollback and failover steps, and comms templates.
  • Run a 30-minute game day: simulate a bad deploy, practice rollback, write two updates, and capture a timeline.
  • Create a "last 24h changes" dashboard: deploys, config toggles, and schema changes to speed triage.

Learning path

  1. Define severities and communication cadences for your team.
  2. Create and test a first 5 minutes checklist.
  3. Add rollback and feature flag procedures to runbooks.
  4. Instrument golden signals and connect alerts to on-call.
  5. Practice with a monthly incident drill and iterate on gaps.

Who this is for

  • Platform Engineers and SREs participating in on-call.
  • Backend Engineers who own services in production.
  • Engineering Managers who coordinate incident response.

Prerequisites

  • Basic understanding of your stack (HTTP, services, databases, queues).
  • Familiarity with monitoring/alerting and dashboards.
  • Ability to deploy/rollback (CI/CD and feature flags).

Next steps

  • Expand runbooks to include disaster recovery and regional failover.
  • Introduce canary releases and progressive delivery to reduce blast radius.
  • Adopt blameless post-incident reviews with clear action owners and deadlines.

Mini challenge

In 15 minutes, draft a severity matrix and a first 5 minutes checklist for one service you own. Keep it to one page. Share it with a teammate to validate.

Quick Test

Take the quick test below to check your understanding. Everyone can take it for free. Progress is saved only for logged-in users.

Practice Exercises

2 exercises to complete

Instructions

Pick one service you know. Create a single-page runbook that includes:

  • Severity matrix with example impacts and update cadences.
  • First 5 minutes checklist.
  • Rollback or disable steps for the most common risky change.
  • Escalation contacts and backups.
  • Status update template with next-update time.

Keep it executable by any on-call, not just experts.

Expected Output
A concise, actionable runbook page covering severity, first-five-minutes, rollback, escalation, and comms template.

Handling Incidents And Outages — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Handling Incidents And Outages?

AI Assistant

Ask questions about this tool