luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Operational Runbooks

Learn Operational Runbooks for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Who this is for

  • ETL Developers who deploy and support scheduled data pipelines.
  • Data/Analytics Engineers preparing handover to operations or on-call.
  • DataOps/Platform engineers standardizing incident response.

Prerequisites

  • Basic understanding of your scheduler (e.g., DAGs, jobs, retries, SLA).
  • Comfort with SQL and bash/CLI for checks and fixes.
  • Awareness of your data sources, targets, and access permissions.

Why this matters

In production, things break at 2 a.m. A clear runbook turns panic into a process. It shortens downtime, reduces data defects, and makes handovers safe.

  • On-call can triage and fix common failures without guessing.
  • New team members can perform safe start/stop, backfills, and rollbacks.
  • Stakeholders get consistent communication and timelines.

Concept explained simply

An operational runbook is a step-by-step guide for running, monitoring, and recovering ETL pipelines. It covers what to watch, what can go wrong, how to respond, and when to escalate.

Mental model

Think of a runbook as three layers:

  • Detect: How you notice an issue (alerts, dashboards, checks).
  • Decide: The triage path (Is this transient? data issue? infra?).
  • Do: The exact commands and steps (retry, backfill, rollback, notify).
Core sections your runbook should include
  • Overview: Purpose, owners, data flow diagram (brief description).
  • Schedules & SLAs: When it runs, expected duration, deadlines.
  • Dependencies: Upstream/downstream jobs, external systems.
  • Monitoring & Alerts: What alerts fire; where to check health.
  • Triage Decision Tree: If X then Y steps, with stop rules.
  • Standard Operations: Start/stop, manual trigger, safe retry.
  • Recovery: Backfill, reprocess, rollback, data repair steps.
  • Data Quality: Critical checks and what to do on failures.
  • Access & Secrets: How to get access; what not to do.
  • Change & Release: How changes are deployed; rollback plan.
  • Communication: Who to notify, templates for updates.
  • Escalation Matrix: Contacts and time-based thresholds.
  • Audit Trail: Where to record incidents and resolutions.

Worked examples

Example 1 — Source API rate limiting (HTTP 429)

Symptom: Extract task fails with 429; retries keep failing.

  • Detect: Alert "extract_task failed 3 times with 429".
  • Decide: Transient vs policy. Check API status page notes (if available internally) and job logs for rate-limit headers.
  • Do:
    • Pause job retries for 20 minutes (avoid thundering herd).
    • Lower concurrency for this job to 1; increase backoff to 10m.
    • Trigger one retry. If success, resume schedule.
    • If still failing after 2 hours: escalate to integration owner.
  • Data repair: If partial day extracted, run targeted backfill for the missed window only.
  • Comms: "Source rate-limited requests; applied backoff; ETA +30m. No downstream impact expected."
Example 2 — Late-arriving partition

Symptom: Daily partition for dt=2026-01-10 missing by 03:00 SLA.

  • Detect: DQ check "expected partition exists" failed.
  • Decide: Upstream delay vs our load issue. Check upstream job status and landing bucket for the partition.
  • Do:
    • If upstream is late: set downstream to wait, extend SLA +2h, notify analytics consumers.
    • If data present but load failed: rerun loader task for that partition only.
  • Data repair: Verify row counts vs control table; run reconciliation query.
  • Comms: "Partition delay; new ETA 05:00. Dashboards for date 2026-01-10 may be stale."
Example 3 — Schema change added a nullable column

Symptom: Transform task fails with "unknown column" in select list.

  • Detect: Alert from transform job and schema drift check.
  • Decide: Backward compatible? If column added and nullable, quick fix may be safe.
  • Do:
    • Hotfix mapping to include the column with default or ignore safely.
    • Re-run transform for the failed partition.
  • Rollback: If hotfix fails, revert mapping to last known version and run.
  • Follow-up: Create change request to formalize schema update and tests.
  • Comms: "Upstream added column; applied safe mapping; data complete."
Example 4 — Orchestrator outage during run

Symptom: Scheduler UI unreachable; jobs mid-flight.

  • Detect: Heartbeat alert; no logs updating.
  • Decide: Infra incident. Stop manual intervention unless data corruption risk.
  • Do:
    • Confirm with platform team. Avoid double-triggering tasks.
    • Once restored, mark running tasks as failed and re-run from last safe checkpoint per runbook.
  • Data repair: Backfill missed windows.
  • Comms: "Platform incident; will backfill on recovery; ETA depends on platform."

How to build a solid runbook

Step 1: Draft a one-page overview (purpose, owners, schedule, SLA, data map).
Step 2: List top 5 failure modes with exact triage steps and safe retries.
Step 3: Document recovery flows (backfill, reprocess, rollback) with commands.
Step 4: Add communication templates and escalation thresholds.
Step 5: Run a tabletop drill and fix gaps found.
Copy-paste runbook template (short)
Overview
- Name:
- Purpose:
- Owners (primary, backup):
- Schedule & SLA:
- Data flow (short text):

Monitoring & Alerts
- Dashboards:
- Alerts (name -> trigger -> action):

Triage Decision Tree (top 5)
- If [extract 429] -> pause retries 20m -> lower concurrency -> retry -> escalate after 2h.
- If [missing partition] -> check upstream -> wait or rerun loader partition-only.
- If [schema drift] -> apply safe mapping -> re-run -> rollback if fails.

Standard Ops
- Start/Stop:
- Manual trigger:
- Safe retry policy:

Recovery
- Backfill command examples:
- Reprocess scope:
- Rollback steps:

Data Quality
- Critical checks:
- What to do on failure:

Access & Escalation
- How to get access:
- P0/P1/P2 thresholds and contacts:

Communication Templates
- Incident start:
- Update cadence:
- Resolution summary:

Audit
- Where to log incident notes:

Common mistakes and self-check

  • Too much theory, not enough commands. Fix: Add exact CLI/UI paths and example parameters.
  • Missing rollback. Fix: Include revert steps for configs and data.
  • Stale contacts. Fix: Add monthly contact check reminder.
  • Ambiguous scope. Fix: Define boundaries (what this runbook covers vs not).
  • Unverified steps. Fix: Run a quarterly tabletop and update notes.
  • Tool-only focus. Fix: Include data checks, not just orchestrator steps.
Self-check checklist
  • Can a new on-call person restore a failed run without help?
  • Are commands copy-pasteable with placeholders clearly marked?
  • Is there a decision tree for top 5 incidents?
  • Does it include comms templates and escalation timers?
  • Have you tested backfill and rollback recently?

Practical projects

  • Project 1: Create a runbook for one production DAG. Include at least three failure modes and a backfill guide.
  • Project 2: Tabletop drill. Simulate a P1 incident; time each step; note gaps; update your runbook.
  • Project 3: Add a partition-level backfill script with guardrails (date range limits, dry-run).
  • Project 4: Build a minimal data quality section (row counts, partition existence, critical metric thresholds).

Learning path

  • Start: Understand pipeline components and SLAs.
  • Then: Draft runbook using the template.
  • Next: Add DQ checks and recovery flows.
  • Finally: Tabletop test, refine, and handover to on-call.

Exercises

Everyone can do the exercises and quick test for free. Logging in lets you save progress.

Exercise 1 — First-hour action plan

Scenario: Your daily sales DAG failed mid-run. Write the first-hour plan in bullet points.

  • Include: detection, triage steps, safe retry, comms, and escalation threshold.
Need a hint?
  • Start with what you check in the UI/logs.
  • Define how many retries and spacing before escalation.

Exercise 2 — Alert-to-action mapping

Create a mapping for three alerts to concrete actions.

  • Alerts: Job SLA breach, Missing partition, Row-count anomaly.
  • For each, specify: source of truth to check, action, and when to escalate.
Need a hint?
  • Keep actions copy-pasteable.
  • Set explicit timers (e.g., escalate after 2 failed retries).
Exercise checklist
  • I stated how the issue is detected.
  • I listed 3–5 triage steps in order.
  • I provided a safe retry or backfill command.
  • I wrote who to notify and when to escalate.
  • I kept steps short and actionable.

Next steps

  • Publish your runbook where on-call can find it.
  • Schedule a 30-minute tabletop drill with a teammate.
  • Add a quarterly reminder to validate contacts, commands, and SLAs.

Mini challenge

Pick one real pipeline. In 25 minutes, fill the template: Overview, top 3 incidents, one backfill example, and an incident start message. Keep it to one page.

Practice Exercises

2 exercises to complete

Instructions

Write a first-hour response plan for a failed daily ETL DAG. Keep it to 8–12 bullets.

  • Include: detection, initial triage checks, safe retry approach, communication, and escalation thresholds.
  • Assume the failure happened at 02:10, SLA is 04:00.
Expected Output
A short, ordered list of actions with exact checks, retry/backoff, and who/when to notify.

Operational Runbooks — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Operational Runbooks?

AI Assistant

Ask questions about this tool