Who this is for
- Backend engineers who participate in on-call or own services in production.
- Platform/SRE engineers who standardize incident response.
- Team leads who want consistent, low-stress operations.
Prerequisites
- Basic familiarity with your service architecture and deployment process.
- Ability to read alerts and logs; comfort with CLI operations (e.g., kubectl, systemctl, or cloud CLI).
- Awareness of SLOs/SLIs and how your team defines “healthy.”
Why this matters
Operational runbooks turn stressful, ambiguous incidents into repeatable, safe routines. They reduce MTTR, prevent guesswork, and help new responders act confidently.
- Real tasks: handle high latency alerts, clear consumer lag, rotate TLS certificates, recover from a bad deploy, or fail over a database.
- Outcomes: faster recovery, fewer mistakes, clear hand-offs and escalation, better post-incident learning.
Concept explained simply
A runbook is a short, step-by-step guide that tells an on-call responder exactly what to do when a specific operational situation happens. It is concrete, timeboxed, and safe by default.
Mental model
Think of a runbook as a flight checklist: it doesn’t teach aerodynamics; it gives the exact actions to take, in order, with guardrails. It answers three questions: What triggered? What must I check? What can I safely do now?
Runbook anatomy (use this template)
Copy-ready template
Title: [Short, specific] (e.g., API p95 latency high)
Purpose: Why this runbook exists (what it resolves, business impact).
Scope: What this covers and what it does not.
Triggers: Exact alert names, dashboards/metrics to confirm.
Prerequisites: Access, tools, feature flags, backups.
Safety/Warnings: Irreversible actions, known risks, timebox (e.g., 15 min).
Quick Decision Tree:
- If X, go to Step A.
- If Y, go to Step B.
Diagnostics (read-only checks):
1) ... (commands/queries with expected outputs)
2) ...
Remediation Steps (one safe action at a time):
A) Action name (why, how)
- Command(s):
- Expected result:
- Rollback:
Verification: Success criteria (numbers + duration), what to monitor.
Rollback/Undo: Exact steps to revert changes.
Escalation: Who/where, and when (after X minutes or if Y condition).
Notes/Learnings: Common patterns, links to past incident summaries.
Last Reviewed: [date] Owner: [team/person]Worked examples
Example 1 — HTTP latency p95 high
Purpose: Restore API responsiveness when p95 > 800 ms for 10+ minutes.
Triggers: Alert "api_latency_p95_above_800ms".
Safety: Timebox 15 minutes before escalation. Avoid scaling database writes without confirming capacity.
Diagnostics:
- Check traffic spikes and errors:
kubectl top pods -n api kubectl logs deploy/api -n api --tail=200 | grep -i timeout - Check DB saturation (read replicas, connections):
SELECT state, count(*) FROM pg_stat_activity GROUP BY 1; - Check external dependencies (cache hit rate):
GET /metrics# cache_hit_ratio
Remediations:
- Add one API replica (safe first step).
Verify CPU per pod decreases.kubectl scale deploy/api -n api --replicas=+1 - Warm cache if hit ratio < 80% (safe): trigger cache warm job.
- Feature-flag heavy endpoint: disable "report_v2" if it dominates slow calls.
Verification: p95 < 800 ms for 15 minutes; error rate < 1%.
Rollback: Re-enable feature flag; scale replicas back if needed.
Escalation: After 15 minutes or if DB CPU > 90% persistently, page on-call DBA.
Example 2 — Disk filling up on node
Purpose: Prevent outage due to full disk (> 90%).
Diagnostics:
Identify culprit mount.df -h
Large logs?du -sh /var/log/* | sort -h | tail- Check orphaned docker layers.
docker system df
Remediations (do one, verify, then proceed):
- Rotate/compress logs safely.
logrotate -f /etc/logrotate.d/app - Prune unused images (safe if no deploy in progress).
docker image prune -f - Move large temp artifacts to object storage (document path).
Verification: Free space > 20% and stable for 30 minutes.
Rollback: Restore any mistakenly removed files from backup location.
Escalation: If < 15% free after remediation, notify platform on-call.
Example 3 — Kafka consumer lag high
Purpose: Reduce lag to keep data processing near real-time.
Diagnostics:
- Confirm alert and trend (topic, partition).
- Check consumer errors/retries:
kubectl logs deploy/ingestor -n data --tail=200 | grep -i error - Check broker health (ISR, under-replicated partitions).
Remediations:
- Scale consumers by +1 if CPU > 70% and no throttling:
kubectl scale deploy/ingestor -n data --replicas=+1 - Increase max in-flight or batch size cautiously (include rollback).
- Throttle noisy producers if downstream cannot keep up (with approval).
Verification: Lag trending down to baseline; processing delay < 1 min for 20 minutes.
Rollback: Revert config changes; scale back replicas if overprovisioned.
Escalation: If ISR issues persist, escalate to streaming platform on-call.
Quality checklist
Exercises
Exercise 1 — Structure a High CPU alert runbook
You received these messy notes: "CPU high on api pods, maybe a loop in /report, scale? check logs, users complaining. Could roll back last deploy. DB ok?" Convert them into the provided template sections.
Need a hint?
- Group notes into Diagnostics vs Remediation.
- Add objective triggers and verification (numbers + duration).
- Include rollback and escalation timebox.
Show solution
Title: API CPU usage high
Purpose: Reduce CPU saturation to restore stable response times.
Triggers: Alert "api_pod_cpu_over_80pct_5m".
Diagnostics:
- Check hot endpoints: review logs for /report.
kubectl logs deploy/api -n api --tail=200 | grep "/report" - Confirm DB is OK (to avoid blind scaling): CPU < 70%, connections stable.
Remediation:
- Scale api deploy by +1 replica.
- If /report dominates, temporarily disable with feature flag.
- If deploy happened < 30 min ago, roll back last version.
Verification: CPU < 70% per pod for 10 minutes; p95 < SLO threshold.
Rollback: Re-enable flag; restore previous replica count.
Escalation: After 15 minutes with no improvement, page on-call lead.
Exercise 2 — Add safety guardrails and SLO checks
Given this runbook excerpt: "If DB is slow, increase max connections and restart app." Improve it by adding explicit safety warnings, rollback, and SLO verification.
Need a hint?
- State the risk of raising max connections.
- Define verification using your API SLO (e.g., p95 latency).
- Document rollback to prior setting.
Show solution
Safety: Increasing DB connections can exhaust memory and worsen thrash. Do not exceed 20% above baseline without DBA approval.
Action: Raise app.max_db_connections by +10% and restart one replica at a time.
Rollback: Restore previous value; restart rolling back.
Verification: API p95 < 800 ms for 15 minutes; DB CPU < 80%; connection wait time trending down.
Escalation: If no improvement after 10 minutes or errors spike, revert and page DBA.
Exercise self-check
- Did you include clear triggers and success metrics?
- Does every action have rollback?
- Are timeboxes and escalation objective?
- Are commands specific with placeholders where needed?
Common mistakes
- Vague triggers ("system slow"): Use alert names and thresholds.
- Actions without rollback: Always include exact undo steps.
- Long theory: Keep it short and action-focused; link background internally if needed.
- Unsafe batch changes: One change at a time; verify between steps.
- Stale runbooks: Add "Last Reviewed" and revisit after every incident or major change.
Practical projects
- Runbook Library MVP: Create 5 top incident runbooks using the template; add owner and review dates.
- Dry-Run Drill: Pair with a teammate; simulate alerts and follow the runbook step-by-step; note friction points.
- Guardrail Audit: Find and fix missing rollbacks and safety warnings in 3 runbooks.
- ChatOps Snippets: Turn top 3 steps into slash-command snippets (copy-paste commands with placeholders).
Learning path
- Define SLOs/SLIs for your service and map top 5 alerts to runbooks.
- Write runbooks using the template; keep each under ~2 pages.
- Dry-run with teammates; improve clarity and safety.
- Instrument verification metrics and make them obvious in dashboards.
- Operationalize: add owners, review cadence, and incident-to-runbook updates.
Next steps
- Apply the template to one of your current alerts today.
- Run a 20-minute tabletop: pick an example and practice.
- Take the quick test below to confirm understanding. Note: The quick test is available to everyone; log in to save your progress.
Mini challenge
Pick one runbook step that currently takes > 8 minutes (e.g., "identify hot endpoint"). Rewrite it so a new on-call can complete it in < 2 minutes with copy-paste commands, explicit expected output, and a success/failure branch.