Menu

Topic 7 of 8

Incident Response Basics

Learn Incident Response Basics for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Why this matters

Incidents are unplanned interruptions or quality reductions. As a Backend Engineer, you will be paged, triage problems, restore service, and explain impact. Good incident response cuts downtime, protects customers, and reduces stress.

  • Real tasks you will do: acknowledge alerts, declare incidents, assign roles, stabilize systems, communicate status, and drive a post-incident review.
  • Typical outcomes: faster time to detect (TTD), faster time to mitigate (TTM), clear timelines, and actionable follow-ups.

Concept explained simply

Incident response is a short, focused loop: detect a problem, gather facts, stop the bleeding, restore service, then learn from it.

Mental model

Think of a small emergency room for software: one person leads (Incident Commander), specialists treat (Ops/SMEs), one keeps stakeholders informed (Communications), and one writes everything down (Scribe). You stabilize first, diagnose second, and improve after.

Role cheat sheet
  • Incident Commander (IC): owns decisions and flow; keeps people on task and time-boxed.
  • Operations Lead/SME: does hands-on technical work to mitigate and remediate.
  • Communications: posts clear, regular updates with impact and next update time.
  • Scribe: records timeline, actions, and decisions for later review.
  • Stakeholders: receive updates; avoid joining technical channels unless requested.
Severity scale (example)
  • SEV1: Critical outage, major impact to most users; immediate all-hands.
  • SEV2: Degraded performance or partial outage; elevated attention.
  • SEV3: Minor impact, workaround exists; business hours response.

Pick severity by customer impact and urgency, not by hunch.

Incident lifecycle in 5 steps

  1. Detect and declare: acknowledge alert, confirm impact, set SEV, assign IC.
  2. Triage: gather key signals (metrics, logs, recent changes), form hypotheses.
  3. Contain: stop the bleeding (rollback, rate limit, failover, feature flag).
  4. Remediate: fix the cause, validate, and monitor.
  5. Recover and learn: communicate resolved, capture timeline, write follow-ups.
Time-boxing guide
  • Declare or downgrade within 5–10 minutes of detection.
  • Containment attempts should be time-boxed (e.g., 10–15 minutes each).
  • Update cadence: every 15–30 minutes until stable; then every 60 minutes.

Worked examples

Example 1: Latency spike after deployment

Situation: API p95 latency doubled right after a deploy.

  • Declare: IC sets SEV2 based on user impact and announces roles.
  • Triage: Check deploy timeline, error rate, DB CPU, queue depth.
  • Contain: Rollback the last deploy; apply read cache TTL reduction.
  • Remediate: Identify N+1 query introduced; fix code and add test.
  • Recover: Latency returns to baseline; mark resolved; schedule review.
Example 2: Partial outage in one region

Situation: 20% of traffic failing in region A.

  • Declare: SEV2 due to regional impact; IC, SME, Comms assigned.
  • Triage: Health checks failing only in region A; upstream dependency degraded.
  • Contain: Shift traffic to region B using weighted routing; rate limit heavy endpoints.
  • Remediate: Work with dependency team; apply config fix; verify.
  • Recover: Gradually restore traffic; monitor error budgets; close incident.
Example 3: Database saturation at peak

Situation: DB CPU pegged; API timeouts.

  • Declare: SEV1 due to major customer impact; page on-call DBAs.
  • Triage: Identify slow queries and bursty jobs started at the hour.
  • Contain: Kill non-critical batch jobs; enable query-level rate limits.
  • Remediate: Add index; tune query; schedule capacity increase.
  • Recover: Error rates normalize; post-resolution validation; capture action items.

Tools and templates

Incident declaration template
Title: [SEV#] Short description
Start time (UTC):
Impact: Who is affected? How?
IC / Roles: IC, Ops/SME, Comms, Scribe
Current status: (Investigating / Mitigating / Monitoring)
Next update: HH:MM UTC
Status update template
Status: Investigating | Mitigating | Monitoring | Resolved
Impact: (scope + user-visible symptoms)
What changed since last update: (facts only)
Next steps: (containment/remediation)
Next update: HH:MM UTC
Timeline notes template
HH:MM: Alert fired (source)
HH:MM: IC assigned; SEV set to X
HH:MM: Hypothesis A tested; result
HH:MM: Containment action; result
HH:MM: Resolution verified; metrics stable

Exercises you can do now

  1. Exercise 1 — Build a 5-line timeline

    Scenario: At 10:02 UTC, p95 latency spiked 2x. At 10:05 UTC, deploy 4271 went live. At 10:08 UTC, error rate increased from 0.2% to 3%. At 10:12 UTC, rollback completed. At 10:16 UTC, metrics normalized.

    Task: Write a concise 5-line incident timeline using the provided template.

  2. Exercise 2 — Draft a one-page runbook

    Choose a service you know. Outline Triggers, Quick checks, Safe containment, Rollback/Disable steps, and Verification.

Common mistakes and how to self-check

  • Thrashing without an IC: If two people are leading, no one is leading. Self-check: Name the IC and their next checkpoint time.
  • Silent debugging: Stakeholders are in the dark. Self-check: Do you have a posted next update time?
  • Skipping containment: Hunting root cause before stabilizing. Self-check: Can you pause, rollback, or rate limit now?
  • Vague severity: Over/under-reacting. Self-check: Map SEV to actual customer impact.
  • Poor notes: Hard to learn later. Self-check: Do you have timestamps and outcomes for key actions?

Practical projects

  1. Runbook Zero-to-One: Create a runbook for a service with triggers, commands, and rollback steps. Ask a teammate to follow it cold.
  2. Simulated incident: With a peer, role-play a 20-minute SEV2. One is IC; one is SME. Produce a timeline and a status update.
  3. Alert hygiene: Review one noisy alert and tune thresholds, adding a clear action item to its description.

Mini challenge

In 5 minutes, write a status update for a hypothetical cache cluster outage that degrades checkout by 30%. Include impact, what you are doing now, and the next update time.

Who this is for

Backend and platform engineers who participate in on-call, SREs improving reliability, and tech leads coordinating production response.

Prerequisites

  • Basic familiarity with your service architecture, logs, metrics, and deploy process.
  • Ability to read dashboards and compare metrics to baseline.

Learning path

  • Before this: Monitoring fundamentals, alert design, and deploy/rollback basics.
  • Now: Incident response basics (this lesson): roles, lifecycle, communication, and timelines.
  • Next: Root cause analysis, post-incident reviews, and reliability engineering (error budgets, SLOs).

Next steps

  • Customize the templates above for your team.
  • Schedule a 30-minute tabletop exercise to practice.
  • Draft one improvement task from your last incident and get it prioritized.

Check your understanding

The quick test is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Scenario: At 10:02 UTC, p95 latency spiked 2x. At 10:05 UTC, deploy 4271 went live. At 10:08 UTC, error rate increased from 0.2% to 3%. At 10:12 UTC, rollback completed. At 10:16 UTC, metrics normalized.

Task: Write a concise 5-line timeline using the template below.

HH:MM: Event
HH:MM: Event
HH:MM: Event
HH:MM: Event
HH:MM: Event
Expected Output
A 5-line, time-ordered timeline capturing spike, deploy, error increase, rollback, and normalization.

Incident Response Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Incident Response Basics?

AI Assistant

Ask questions about this tool