Topic Not Found

Who this is for

Backend engineers who participate in on-call or own services in production.
Platform/SRE engineers who standardize incident response.
Team leads who want consistent, low-stress operations.

Prerequisites

Basic familiarity with your service architecture and deployment process.
Ability to read alerts and logs; comfort with CLI operations (e.g., kubectl, systemctl, or cloud CLI).
Awareness of SLOs/SLIs and how your team defines “healthy.”

Why this matters

Operational runbooks turn stressful, ambiguous incidents into repeatable, safe routines. They reduce MTTR, prevent guesswork, and help new responders act confidently.

Real tasks: handle high latency alerts, clear consumer lag, rotate TLS certificates, recover from a bad deploy, or fail over a database.
Outcomes: faster recovery, fewer mistakes, clear hand-offs and escalation, better post-incident learning.

Concept explained simply

A runbook is a short, step-by-step guide that tells an on-call responder exactly what to do when a specific operational situation happens. It is concrete, timeboxed, and safe by default.

Mental model

Think of a runbook as a flight checklist: it doesn’t teach aerodynamics; it gives the exact actions to take, in order, with guardrails. It answers three questions: What triggered? What must I check? What can I safely do now?

Runbook anatomy (use this template)

Copy-ready template

Title: [Short, specific] (e.g., API p95 latency high)
Purpose: Why this runbook exists (what it resolves, business impact).
Scope: What this covers and what it does not.
Triggers: Exact alert names, dashboards/metrics to confirm.
Prerequisites: Access, tools, feature flags, backups.
Safety/Warnings: Irreversible actions, known risks, timebox (e.g., 15 min).
Quick Decision Tree:
  - If X, go to Step A.
  - If Y, go to Step B.
Diagnostics (read-only checks):
  1) ... (commands/queries with expected outputs)
  2) ...
Remediation Steps (one safe action at a time):
  A) Action name (why, how)
     - Command(s):
     - Expected result:
     - Rollback:
Verification: Success criteria (numbers + duration), what to monitor.
Rollback/Undo: Exact steps to revert changes.
Escalation: Who/where, and when (after X minutes or if Y condition).
Notes/Learnings: Common patterns, links to past incident summaries.
Last Reviewed: [date] Owner: [team/person]

Worked examples

Example 1 — HTTP latency p95 high

Purpose: Restore API responsiveness when p95 > 800 ms for 10+ minutes.

Triggers: Alert "api_latency_p95_above_800ms".

Safety: Timebox 15 minutes before escalation. Avoid scaling database writes without confirming capacity.

Diagnostics:

Check traffic spikes and errors:

kubectl top pods -n api
kubectl logs deploy/api -n api --tail=200 | grep -i timeout

Check DB saturation (read replicas, connections):

SELECT state, count(*) FROM pg_stat_activity GROUP BY 1;

Check external dependencies (cache hit rate):
```
GET /metrics# cache_hit_ratio
```

Remediations:

Add one API replica (safe first step).
```
kubectl scale deploy/api -n api --replicas=+1
```
Verify CPU per pod decreases.
Warm cache if hit ratio < 80% (safe): trigger cache warm job.
Feature-flag heavy endpoint: disable "report_v2" if it dominates slow calls.

Verification: p95 < 800 ms for 15 minutes; error rate < 1%.

Rollback: Re-enable feature flag; scale replicas back if needed.

Escalation: After 15 minutes or if DB CPU > 90% persistently, page on-call DBA.

Example 2 — Disk filling up on node

Purpose: Prevent outage due to full disk (> 90%).

Diagnostics:

```
df -h
```
Identify culprit mount.
```
du -sh /var/log/* | sort -h | tail
```
Large logs?
Check orphaned docker layers.
```
docker system df
```

Remediations (do one, verify, then proceed):

Rotate/compress logs safely.
```
logrotate -f /etc/logrotate.d/app
```
Prune unused images (safe if no deploy in progress).
```
docker image prune -f
```
Move large temp artifacts to object storage (document path).

Verification: Free space > 20% and stable for 30 minutes.

Rollback: Restore any mistakenly removed files from backup location.

Escalation: If < 15% free after remediation, notify platform on-call.

Example 3 — Kafka consumer lag high

Purpose: Reduce lag to keep data processing near real-time.

Diagnostics:

Confirm alert and trend (topic, partition).

Check consumer errors/retries:

kubectl logs deploy/ingestor -n data --tail=200 | grep -i error

Check broker health (ISR, under-replicated partitions).

Remediations:

Scale consumers by +1 if CPU > 70% and no throttling:
```
kubectl scale deploy/ingestor -n data --replicas=+1
```
Increase max in-flight or batch size cautiously (include rollback).
Throttle noisy producers if downstream cannot keep up (with approval).

Verification: Lag trending down to baseline; processing delay < 1 min for 20 minutes.

Rollback: Revert config changes; scale back replicas if overprovisioned.

Escalation: If ISR issues persist, escalate to streaming platform on-call.

Quality checklist

Trigger is explicit and matches an alert name
Steps are ordered, timeboxed, and safe-first
Every action has verification and rollback
Commands are copy-paste ready with placeholders
Escalation conditions are objective (numbers/time)
Owner and last reviewed date are present

Exercises

Exercise 1 — Structure a High CPU alert runbook

You received these messy notes: "CPU high on api pods, maybe a loop in /report, scale? check logs, users complaining. Could roll back last deploy. DB ok?" Convert them into the provided template sections.

Need a hint?

Group notes into Diagnostics vs Remediation.
Add objective triggers and verification (numbers + duration).
Include rollback and escalation timebox.

Show solution

Title: API CPU usage high

Purpose: Reduce CPU saturation to restore stable response times.

Triggers: Alert "api_pod_cpu_over_80pct_5m".

Diagnostics:

Check hot endpoints: review logs for /report.

kubectl logs deploy/api -n api --tail=200 | grep "/report"

Confirm DB is OK (to avoid blind scaling): CPU < 70%, connections stable.

Remediation:

Scale api deploy by +1 replica.
If /report dominates, temporarily disable with feature flag.
If deploy happened < 30 min ago, roll back last version.

Verification: CPU < 70% per pod for 10 minutes; p95 < SLO threshold.

Rollback: Re-enable flag; restore previous replica count.

Escalation: After 15 minutes with no improvement, page on-call lead.

Exercise 2 — Add safety guardrails and SLO checks

Given this runbook excerpt: "If DB is slow, increase max connections and restart app." Improve it by adding explicit safety warnings, rollback, and SLO verification.

Need a hint?

State the risk of raising max connections.
Define verification using your API SLO (e.g., p95 latency).
Document rollback to prior setting.

Show solution

Safety: Increasing DB connections can exhaust memory and worsen thrash. Do not exceed 20% above baseline without DBA approval.

Action: Raise app.max_db_connections by +10% and restart one replica at a time.

Rollback: Restore previous value; restart rolling back.

Verification: API p95 < 800 ms for 15 minutes; DB CPU < 80%; connection wait time trending down.

Escalation: If no improvement after 10 minutes or errors spike, revert and page DBA.

Exercise self-check

Did you include clear triggers and success metrics?
Does every action have rollback?
Are timeboxes and escalation objective?
Are commands specific with placeholders where needed?

Common mistakes

Vague triggers ("system slow"): Use alert names and thresholds.
Actions without rollback: Always include exact undo steps.
Long theory: Keep it short and action-focused; link background internally if needed.
Unsafe batch changes: One change at a time; verify between steps.
Stale runbooks: Add "Last Reviewed" and revisit after every incident or major change.

Practical projects

Runbook Library MVP: Create 5 top incident runbooks using the template; add owner and review dates.
Dry-Run Drill: Pair with a teammate; simulate alerts and follow the runbook step-by-step; note friction points.
Guardrail Audit: Find and fix missing rollbacks and safety warnings in 3 runbooks.
ChatOps Snippets: Turn top 3 steps into slash-command snippets (copy-paste commands with placeholders).

Learning path

Define SLOs/SLIs for your service and map top 5 alerts to runbooks.
Write runbooks using the template; keep each under ~2 pages.
Dry-run with teammates; improve clarity and safety.
Instrument verification metrics and make them obvious in dashboards.
Operationalize: add owners, review cadence, and incident-to-runbook updates.

Next steps

Apply the template to one of your current alerts today.
Run a 20-minute tabletop: pick an example and practice.
Take the quick test below to confirm understanding. Note: The quick test is available to everyone; log in to save your progress.

Mini challenge

Pick one runbook step that currently takes > 8 minutes (e.g., "identify hot endpoint"). Rewrite it so a new on-call can complete it in < 2 minutes with copy-paste commands, explicit expected output, and a success/failure branch.

Menu

Operational Runbooks

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Runbook anatomy (use this template)

Worked examples

Quality checklist

Exercises

Exercise 1 — Structure a High CPU alert runbook

Exercise 2 — Add safety guardrails and SLO checks

Exercise self-check

Common mistakes

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Structure a High CPU alert runbook

Instructions

Expected Output

Add safety guardrails and SLO checks

Operational Runbooks — Quick Test

Have questions about Operational Runbooks?

AI Assistant