Menu

Topic 8 of 8

Postmortems Basics

Learn Postmortems Basics for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Why this matters

Root cause: Certificate not renewed before expiry.

Contributing factors: No expiry alert; manual renewal; calendar invite expired; lack of peer review for rotation schedule.

What helped: Good runbook for CDN cache flush.

Actions: Automate ACME renewal (Owner: Infra, Due: 14 days); add 30/14/7-day expiry alerts (Owner: SRE, Due: 7 days); add peer-review checklist for cert assets (Owner: Platform, Due: 21 days).

Example 2 — Performance regression after feature flag rollout

Summary: P95 latency increased from 250ms to 1.8s for 42 minutes after enabling a feature flag that added an unindexed query.

Impact: 12% session abandonment increase; 1.1M slow requests; no data loss.

Timeline: 14:00 flag enabled (10%); 14:05 latency alert; 14:09 rollback; 14:12 metrics normal.

Root cause: Query used unselective filter without index.

Contributing factors: Load test used smaller dataset; canary monitoring did not include latency SLO; missing pre-merge query plan check.

Actions: Add composite index (Owner: DB, Due: 3 days); add automated EXPLAIN check to CI (Owner: Backend, Due: 14 days); include latency SLO in canary gate (Owner: SRE, Due: 10 days).

Example 3 — Near-miss: data inconsistency avoided by idempotency

Summary: Spike in duplicate webhook deliveries caused temporary double-processing of payments, mostly prevented by idempotency keys. 0.03% orders double-charged; automatically reversed within 6 minutes.

Impact: 1,200 affected orders; 36 manual support tickets.

Timeline: 20:11 webhook provider retry storm; 20:13 idempotency dashboard alert; 20:16 increased cache TTL; 20:17 rate-limit tuned; 20:19 normal.

Root cause: Retry policy change at provider increased burstiness.

Contributing factors: Our rate-limit window too long; idempotency cache TTL too short for burst.

Actions: Negotiate provider retry backoff (Owner: Partner Eng, Due: 7 days); adaptive rate-limiting (Owner: Backend, Due: 21 days); extend idempotency retention (Owner: Platform, Due: 5 days).

Templates and checklists

Copy-ready postmortem template
Title:
Date:
Severity:
Owner:

Summary:
- What happened in one paragraph (system, symptom, duration, resolution)

Impact:
- Users/services affected, metrics (errors/latency/data), business impact

Timeline (UTC):
- HH:MM Event

Root cause:
- Technical cause
- Contributing factors (process/tooling/organizational)

Detection & response:
- How detected, when paged, mitigations tried, what helped/blocked

What went well / What didn’t:
- Bullets

Action items:
- [Owner] [Due date] [Preventative/Detective/Mitigative] — Description

Follow-up:
- Where actions are tracked, review date, comms plan
  
  • Summary is blameless and understandable by non-engineers.
  • Impact includes measurable metrics and time bounds.
  • Timeline is chronological and evidence-based.
  • Root cause addresses system/process, not a person.
  • Actions are specific, owned, and dated.
  • Follow-up date set to verify action completion.

Exercises

Exercise 1 — Rewrite to blameless + add impact + action

Original statement: “Alice forgot to renew the certificate, so the site went down.”

  1. Rewrite the summary to be blameless and system-focused.
  2. Add a measurable impact statement (duration + metric).
  3. Propose one preventative action with owner and due date.
Show solution

Blameless summary: “The edge TLS certificate expired, causing TLS handshakes to fail for incoming requests until a new certificate was installed and caches cleared.”

Impact: “27 minutes; ~240k failed requests; checkout success down from 99.7% to 81.2%.”

Action: “Automate certificate renewal via ACME and add 30/14/7-day expiry alerts. Owner: Infra; Due: 14 days.”

  • No names or blame in the summary.
  • Impact includes time and at least one metric.
  • Action is specific with owner and due date.

Exercise 2 — Build a timeline

Events (unordered, UTC):

  • 09:24 New certificate installed; CDN cache flushed
  • 09:02 Error rate alert fired
  • 09:12 Expired certificate identified on load balancer
  • 09:06 On-call acknowledged alert
  • 09:29 Metrics returned to baseline
  1. Order these into a proper timeline.
  2. Mark detection-to-mitigation and mitigation-to-recovery durations.
Show solution

Timeline:

  • 09:02 Error rate alert fired
  • 09:06 On-call acknowledged alert
  • 09:12 Expired certificate identified on load balancer
  • 09:24 New certificate installed; CDN cache flushed (mitigation)
  • 09:29 Metrics returned to baseline (recovery)

Durations: Detection-to-mitigation: 22 minutes (09:02–09:24). Mitigation-to-recovery: 5 minutes (09:24–09:29).

  • All events are chronological.
  • Mitigation and recovery clearly labeled.
  • Durations are computed correctly.

Common mistakes and self-check

  • Blaming individuals. Self-check: Does your summary mention names or focus on behavior over systems?
  • Missing measurable impact. Self-check: Can you state duration and a concrete metric (errors, latency, data)?
  • Non-chronological timeline. Self-check: Do timestamps strictly increase?
  • Vague actions. Self-check: Does each action have an owner and due date, and change how work is done?
  • Stopping at the first “why.” Self-check: Did you ask “why” at least 3–5 times to reach process/tooling gaps?
  • No follow-up. Self-check: Is there a review date to verify action completion?

Mini challenge

In one paragraph, write a blameless summary for: After a config change, cache TTL for product pages was set to 0, causing origin overload and 503s for 15 minutes until TTL reverted. Include a measurable impact and one action item.

Hints
  • Describe symptom, cause, duration, and recovery.
  • Impact could be requests failed, latency, or conversion drop.
  • Action should prevent or detect TTL misconfig (validation, canary, alert).

Who this is for

  • Backend, SRE, Platform engineers who participate in incident response and reliability work.
  • Team leads who facilitate reviews and drive follow-up.

Prerequisites

  • Basic understanding of your service architecture and monitoring tools.
  • Ability to read logs, dashboards, and deployment histories.

Learning path

  1. Learn incident severity and SLO basics.
  2. Practice timelines and evidence gathering.
  3. Apply 5 Whys and write action items.
  4. Facilitate a review and publish learnings.

Practical projects

  • Retro your last on-call page: write a short postmortem even if impact was low.
  • Create a lightweight template and checklist for your team.
  • Add one new detector (alert) inspired by a past incident.

Next steps

  • Do the exercises above, then take the Quick Test to check your understanding.
  • Share a postmortem draft with a peer for review.

Note: The Quick Test is available to everyone. Only logged-in users have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Original: “Alice forgot to renew the certificate, so the site went down.”

  1. Rewrite the summary to be blameless and system-focused.
  2. Add a measurable impact statement (duration + metric).
  3. Propose one preventative action with owner and due date.
Expected Output
A blameless summary, a quantifiable impact (time + metric), and one SMART preventative action with an owner and due date.

Postmortems Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Postmortems Basics?

AI Assistant

Ask questions about this tool