Why this matters
Root cause: Certificate not renewed before expiry.
Contributing factors: No expiry alert; manual renewal; calendar invite expired; lack of peer review for rotation schedule.
What helped: Good runbook for CDN cache flush.
Actions: Automate ACME renewal (Owner: Infra, Due: 14 days); add 30/14/7-day expiry alerts (Owner: SRE, Due: 7 days); add peer-review checklist for cert assets (Owner: Platform, Due: 21 days).
Example 2 — Performance regression after feature flag rollout
Summary: P95 latency increased from 250ms to 1.8s for 42 minutes after enabling a feature flag that added an unindexed query.
Impact: 12% session abandonment increase; 1.1M slow requests; no data loss.
Timeline: 14:00 flag enabled (10%); 14:05 latency alert; 14:09 rollback; 14:12 metrics normal.
Root cause: Query used unselective filter without index.
Contributing factors: Load test used smaller dataset; canary monitoring did not include latency SLO; missing pre-merge query plan check.
Actions: Add composite index (Owner: DB, Due: 3 days); add automated EXPLAIN check to CI (Owner: Backend, Due: 14 days); include latency SLO in canary gate (Owner: SRE, Due: 10 days).
Example 3 — Near-miss: data inconsistency avoided by idempotency
Summary: Spike in duplicate webhook deliveries caused temporary double-processing of payments, mostly prevented by idempotency keys. 0.03% orders double-charged; automatically reversed within 6 minutes.
Impact: 1,200 affected orders; 36 manual support tickets.
Timeline: 20:11 webhook provider retry storm; 20:13 idempotency dashboard alert; 20:16 increased cache TTL; 20:17 rate-limit tuned; 20:19 normal.
Root cause: Retry policy change at provider increased burstiness.
Contributing factors: Our rate-limit window too long; idempotency cache TTL too short for burst.
Actions: Negotiate provider retry backoff (Owner: Partner Eng, Due: 7 days); adaptive rate-limiting (Owner: Backend, Due: 21 days); extend idempotency retention (Owner: Platform, Due: 5 days).
Templates and checklists
Copy-ready postmortem template
Title: Date: Severity: Owner: Summary: - What happened in one paragraph (system, symptom, duration, resolution) Impact: - Users/services affected, metrics (errors/latency/data), business impact Timeline (UTC): - HH:MM Event Root cause: - Technical cause - Contributing factors (process/tooling/organizational) Detection & response: - How detected, when paged, mitigations tried, what helped/blocked What went well / What didn’t: - Bullets Action items: - [Owner] [Due date] [Preventative/Detective/Mitigative] — Description Follow-up: - Where actions are tracked, review date, comms plan
- Summary is blameless and understandable by non-engineers.
- Impact includes measurable metrics and time bounds.
- Timeline is chronological and evidence-based.
- Root cause addresses system/process, not a person.
- Actions are specific, owned, and dated.
- Follow-up date set to verify action completion.
Exercises
Exercise 1 — Rewrite to blameless + add impact + action
Original statement: “Alice forgot to renew the certificate, so the site went down.”
- Rewrite the summary to be blameless and system-focused.
- Add a measurable impact statement (duration + metric).
- Propose one preventative action with owner and due date.
Show solution
Blameless summary: “The edge TLS certificate expired, causing TLS handshakes to fail for incoming requests until a new certificate was installed and caches cleared.”
Impact: “27 minutes; ~240k failed requests; checkout success down from 99.7% to 81.2%.”
Action: “Automate certificate renewal via ACME and add 30/14/7-day expiry alerts. Owner: Infra; Due: 14 days.”
- No names or blame in the summary.
- Impact includes time and at least one metric.
- Action is specific with owner and due date.
Exercise 2 — Build a timeline
Events (unordered, UTC):
- 09:24 New certificate installed; CDN cache flushed
- 09:02 Error rate alert fired
- 09:12 Expired certificate identified on load balancer
- 09:06 On-call acknowledged alert
- 09:29 Metrics returned to baseline
- Order these into a proper timeline.
- Mark detection-to-mitigation and mitigation-to-recovery durations.
Show solution
Timeline:
- 09:02 Error rate alert fired
- 09:06 On-call acknowledged alert
- 09:12 Expired certificate identified on load balancer
- 09:24 New certificate installed; CDN cache flushed (mitigation)
- 09:29 Metrics returned to baseline (recovery)
Durations: Detection-to-mitigation: 22 minutes (09:02–09:24). Mitigation-to-recovery: 5 minutes (09:24–09:29).
- All events are chronological.
- Mitigation and recovery clearly labeled.
- Durations are computed correctly.
Common mistakes and self-check
- Blaming individuals. Self-check: Does your summary mention names or focus on behavior over systems?
- Missing measurable impact. Self-check: Can you state duration and a concrete metric (errors, latency, data)?
- Non-chronological timeline. Self-check: Do timestamps strictly increase?
- Vague actions. Self-check: Does each action have an owner and due date, and change how work is done?
- Stopping at the first “why.” Self-check: Did you ask “why” at least 3–5 times to reach process/tooling gaps?
- No follow-up. Self-check: Is there a review date to verify action completion?
Mini challenge
In one paragraph, write a blameless summary for: After a config change, cache TTL for product pages was set to 0, causing origin overload and 503s for 15 minutes until TTL reverted. Include a measurable impact and one action item.
Hints
- Describe symptom, cause, duration, and recovery.
- Impact could be requests failed, latency, or conversion drop.
- Action should prevent or detect TTL misconfig (validation, canary, alert).
Who this is for
- Backend, SRE, Platform engineers who participate in incident response and reliability work.
- Team leads who facilitate reviews and drive follow-up.
Prerequisites
- Basic understanding of your service architecture and monitoring tools.
- Ability to read logs, dashboards, and deployment histories.
Learning path
- Learn incident severity and SLO basics.
- Practice timelines and evidence gathering.
- Apply 5 Whys and write action items.
- Facilitate a review and publish learnings.
Practical projects
- Retro your last on-call page: write a short postmortem even if impact was low.
- Create a lightweight template and checklist for your team.
- Add one new detector (alert) inspired by a past incident.
Next steps
- Do the exercises above, then take the Quick Test to check your understanding.
- Share a postmortem draft with a peer for review.
Note: The Quick Test is available to everyone. Only logged-in users have their progress saved.