Why this matters
Runbooks and operational readiness reduce outage time and stress. They turn tribal knowledge into clear, repeatable actions. As a Platform Engineer, you'll use them to ship safer, respond faster, and help teams meet uptime and latency goals.
- Real tasks: prepare a new service for launch; define alerts, dashboards, and rollback plans.
- Real tasks: handle 2 a.m. pages with a calm, proven checklist.
- Real tasks: run game days and improve runbooks after incidents.
Concept explained simply
A runbook is a short, actionable guide for a specific situation: an alert, routine task, or emergency. Operational readiness is a checklist that ensures a system is safe to run in production (monitoring, on-call ownership, rollback, backups, and more).
Mental model
Think of runbooks as the emergency card in an airplane seat: concise, visual, step-by-step, focused on critical actions and outcomes. Operational readiness is the pre-flight checklist: if anything critical is missing, you don’t take off.
Key components of a good runbook
- Purpose and trigger: what starts this runbook (alert name, threshold, routine task).
- Owner and contacts: who maintains it and who to escalate to.
- Pre-checks: quick context (recent deploys, known incidents, dashboards).
- Step-by-step actions: numbered, short, single action per step.
- Validation: how to confirm it worked (metrics/logs/health checks).
- Rollback/mitigation: safe reversal or minimal viable mitigation.
- Comms template: who to update, where, and how often.
- Post-incident notes: what to capture for follow-up.
Worked examples
Example 1 — DB disk space alert (trigger: disk usage > 90% for 5 min)
- Pre-check: confirm current usage and trend on storage dashboard; note if growth is bursty or steady.
- Identify largest consumers: list the biggest tables or files; check recent job spikes.
- Mitigation (fast): purge safe temp files; rotate logs; compress old logs.
- Mitigation (safe): increase volume size or add space to the disk group following storage policy.
- Validation: ensure usage < 80% and trend stabilizes for 10–15 minutes.
- Comms: update incident channel with actions and current headroom.
- Follow-up: schedule partitioning, archival, or retention fixes.
Example 2 — Kubernetes rollback (trigger: 5xx spikes after deploy)
- Pre-check: confirm deployment timestamp vs error spike; check pod restarts and readiness probe failures.
- Mitigation: scale up previous stable ReplicaSet or use rollout undo to last successful version.
- Traffic control: if using blue/green, switch traffic back to blue; verify service endpoints.
- Validation: p95 latency and error rate return to baseline; synthetic checks pass.
- Comms: announce rollback complete and current status.
- Follow-up: freeze further deploys; open bug with commit hash and logs.
Example 3 — Suspected key leak (trigger: security alert)
- Pre-check: identify scope (which key, what access, where observed).
- Mitigation: revoke/rotate the key; invalidate tokens; review access logs for misuse.
- Containment: temporarily restrict high-risk actions or IPs if needed.
- Validation: confirm new keys in use and no new suspicious access.
- Comms: notify stakeholders and on-call security; record incident timeline.
- Follow-up: secret scanning, least-privilege review, add detection to CI.
Operational readiness checklist
Use this before launches and major changes.
- Alerts cover SLO symptoms (latency, errors, availability), not just CPU.
- Dashboards for key user journeys and dependencies.
- On-call rotation, escalation path, and contact methods defined.
- Runbooks exist for top alerts and risky changes.
- Health checks, readiness/liveness probes configured.
- Rollback plan tested (e.g., rollout undo, blue/green switch).
- Backups and restore procedure tested; RTO/RPO documented.
- Error budget policy agreed; release can be paused if needed.
- Access: least-privilege, break-glass steps documented.
- Dependency map and SLAs known (DB, cache, external APIs).
- Load and failure mode tested (scale test, chaos drills).
- Communications plan: channels, update cadence, status notes.
How to write your first runbook (20-minute version)
Templates
Runbook template (copy/paste)
Title: [Service] — [Incident/Task]
Owner/Maintainer: [Team/Person]
Last reviewed: [YYYY-MM-DD]
Trigger: [Exact alert name or routine]
Severity guidance: [Page/Notify/Task]
Pre-checks (60–120s):
- Check dashboard: [which]
- Note recent deploys: [how]
- Known issues? [link or note]
Steps:
1) [Action]
2) [Action]
3) [Action]
Validation:
- Success looks like: [metrics/logs]
- Watch for regression for: [time window]
Rollback/Mitigation:
- [How to undo or safe state]
Comms:
- Incident channel: [where]
- Update cadence: [e.g., every 15 minutes]
- Stakeholders: [who]
Post-incident notes:
- Root cause hypothesis
- Runbook gaps to fix
- Follow-ups (owner + date)
Operational readiness review template
Service: [name]
Owner: [team]
Launch/change: [what/when]
SLOs & Alerts:
- SLOs: [availability, latency]
- Alerts: [symptom-based, thresholds]
Observability:
- Dashboards: [paths, dependencies]
- Logs/Traces: [where/how]
Operations:
- On-call & escalation: [who/how]
- Runbooks: [top 5 alerts covered]
- Rollback plan: [tested?]
Reliability:
- Backups & restore tested: [RTO/RPO]
- Capacity & load tested: [evidence]
- Dependencies & SLAs: [list]
Security:
- Secrets & access: [least-privilege, rotation]
- Change control: [process]
Sign-off:
- Risks: [top 3]
- Go/No-Go: [decision, date]
Common mistakes and self-check
- Mistake: Runbooks are too long. Fix: keep steps short; move background to a details section.
- Mistake: Tool-specific only. Fix: describe intent and outcome alongside commands.
- Mistake: No validation step. Fix: define concrete success signals.
- Mistake: Never reviewed. Fix: add a quarterly review and update owner.
- Mistake: Readiness gates are vague. Fix: use measurable criteria (alerts enabled, rollback tested).
Self-check mini audit
- Can a new on-caller act within 2 minutes using your runbook?
- Is there exactly one clear rollback path?
- Can you prove backup restore works with timestamps?
- Do your alerts map to user pain (not just resource saturation)?
Exercises
Do these to make the concepts stick. Your progress in the quick test is available to everyone; only logged-in learners get saved progress.
Exercise 1 — Draft a Kubernetes "High Pod Restart" runbook
Create a 1-page runbook for the alert: "Pod restart rate > 5 restarts in 10 minutes" for a web service.
- Include trigger, owner, pre-checks, 5–9 steps, validation, rollback, and comms template.
- Keep it actionable: one action per step.
Exercise 2 — Operational readiness for a new microservice
Fill the provided readiness template for a service that depends on a database and an external payment API.
- Define SLOs, alerts, dashboards, rollback, backups, and dependency SLAs.
- Specify on-call ownership and escalation.
Mini challenge
Pick your top paging alert from the last month. In 30 minutes, write or update the runbook and schedule a 15-minute dry-run with a teammate this week.
Practical projects
- Game day: simulate a partial outage (e.g., dependency latency). Follow runbooks, measure time to mitigation, and improve steps.
- Readiness gate in CI: require that each service has a valid runbook and dashboard link before deploy to production.
- Rollback rehearsal: for a non-critical service, practice rollback and validate recovery metrics in under 10 minutes.
Who this is for
- Platform and backend engineers who support production systems.
- Teams adopting on-call rotations and SRE practices.
- New on-call engineers seeking confidence under pressure.
Prerequisites
- Basic understanding of your deployment platform (e.g., containers, VMs).
- Familiarity with your monitoring/alerting and logging tools.
- Ability to roll back or disable a change safely.
Learning path
- Start with one high-value alert runbook (this lesson).
- Add a production readiness checklist to your next launch.
- Run a monthly mini game day and update runbooks.
- Automate: embed checks in CI and add links to dashboards in alerts.
Next steps
- Adopt the templates across services and assign owners.
- Schedule quarterly reviews and a light game day.
- Do the quick test below to confirm understanding.