Who this is for
- ETL Developers who deploy and support scheduled data pipelines.
- Data/Analytics Engineers preparing handover to operations or on-call.
- DataOps/Platform engineers standardizing incident response.
Prerequisites
- Basic understanding of your scheduler (e.g., DAGs, jobs, retries, SLA).
- Comfort with SQL and bash/CLI for checks and fixes.
- Awareness of your data sources, targets, and access permissions.
Why this matters
In production, things break at 2 a.m. A clear runbook turns panic into a process. It shortens downtime, reduces data defects, and makes handovers safe.
- On-call can triage and fix common failures without guessing.
- New team members can perform safe start/stop, backfills, and rollbacks.
- Stakeholders get consistent communication and timelines.
Concept explained simply
An operational runbook is a step-by-step guide for running, monitoring, and recovering ETL pipelines. It covers what to watch, what can go wrong, how to respond, and when to escalate.
Mental model
Think of a runbook as three layers:
- Detect: How you notice an issue (alerts, dashboards, checks).
- Decide: The triage path (Is this transient? data issue? infra?).
- Do: The exact commands and steps (retry, backfill, rollback, notify).
Core sections your runbook should include
- Overview: Purpose, owners, data flow diagram (brief description).
- Schedules & SLAs: When it runs, expected duration, deadlines.
- Dependencies: Upstream/downstream jobs, external systems.
- Monitoring & Alerts: What alerts fire; where to check health.
- Triage Decision Tree: If X then Y steps, with stop rules.
- Standard Operations: Start/stop, manual trigger, safe retry.
- Recovery: Backfill, reprocess, rollback, data repair steps.
- Data Quality: Critical checks and what to do on failures.
- Access & Secrets: How to get access; what not to do.
- Change & Release: How changes are deployed; rollback plan.
- Communication: Who to notify, templates for updates.
- Escalation Matrix: Contacts and time-based thresholds.
- Audit Trail: Where to record incidents and resolutions.
Worked examples
Example 1 — Source API rate limiting (HTTP 429)
Symptom: Extract task fails with 429; retries keep failing.
- Detect: Alert "extract_task failed 3 times with 429".
- Decide: Transient vs policy. Check API status page notes (if available internally) and job logs for rate-limit headers.
- Do:
- Pause job retries for 20 minutes (avoid thundering herd).
- Lower concurrency for this job to 1; increase backoff to 10m.
- Trigger one retry. If success, resume schedule.
- If still failing after 2 hours: escalate to integration owner.
- Data repair: If partial day extracted, run targeted backfill for the missed window only.
- Comms: "Source rate-limited requests; applied backoff; ETA +30m. No downstream impact expected."
Example 2 — Late-arriving partition
Symptom: Daily partition for dt=2026-01-10 missing by 03:00 SLA.
- Detect: DQ check "expected partition exists" failed.
- Decide: Upstream delay vs our load issue. Check upstream job status and landing bucket for the partition.
- Do:
- If upstream is late: set downstream to wait, extend SLA +2h, notify analytics consumers.
- If data present but load failed: rerun loader task for that partition only.
- Data repair: Verify row counts vs control table; run reconciliation query.
- Comms: "Partition delay; new ETA 05:00. Dashboards for date 2026-01-10 may be stale."
Example 3 — Schema change added a nullable column
Symptom: Transform task fails with "unknown column" in select list.
- Detect: Alert from transform job and schema drift check.
- Decide: Backward compatible? If column added and nullable, quick fix may be safe.
- Do:
- Hotfix mapping to include the column with default or ignore safely.
- Re-run transform for the failed partition.
- Rollback: If hotfix fails, revert mapping to last known version and run.
- Follow-up: Create change request to formalize schema update and tests.
- Comms: "Upstream added column; applied safe mapping; data complete."
Example 4 — Orchestrator outage during run
Symptom: Scheduler UI unreachable; jobs mid-flight.
- Detect: Heartbeat alert; no logs updating.
- Decide: Infra incident. Stop manual intervention unless data corruption risk.
- Do:
- Confirm with platform team. Avoid double-triggering tasks.
- Once restored, mark running tasks as failed and re-run from last safe checkpoint per runbook.
- Data repair: Backfill missed windows.
- Comms: "Platform incident; will backfill on recovery; ETA depends on platform."
How to build a solid runbook
Copy-paste runbook template (short)
Overview - Name: - Purpose: - Owners (primary, backup): - Schedule & SLA: - Data flow (short text): Monitoring & Alerts - Dashboards: - Alerts (name -> trigger -> action): Triage Decision Tree (top 5) - If [extract 429] -> pause retries 20m -> lower concurrency -> retry -> escalate after 2h. - If [missing partition] -> check upstream -> wait or rerun loader partition-only. - If [schema drift] -> apply safe mapping -> re-run -> rollback if fails. Standard Ops - Start/Stop: - Manual trigger: - Safe retry policy: Recovery - Backfill command examples: - Reprocess scope: - Rollback steps: Data Quality - Critical checks: - What to do on failure: Access & Escalation - How to get access: - P0/P1/P2 thresholds and contacts: Communication Templates - Incident start: - Update cadence: - Resolution summary: Audit - Where to log incident notes:
Common mistakes and self-check
- Too much theory, not enough commands. Fix: Add exact CLI/UI paths and example parameters.
- Missing rollback. Fix: Include revert steps for configs and data.
- Stale contacts. Fix: Add monthly contact check reminder.
- Ambiguous scope. Fix: Define boundaries (what this runbook covers vs not).
- Unverified steps. Fix: Run a quarterly tabletop and update notes.
- Tool-only focus. Fix: Include data checks, not just orchestrator steps.
Self-check checklist
- Can a new on-call person restore a failed run without help?
- Are commands copy-pasteable with placeholders clearly marked?
- Is there a decision tree for top 5 incidents?
- Does it include comms templates and escalation timers?
- Have you tested backfill and rollback recently?
Practical projects
- Project 1: Create a runbook for one production DAG. Include at least three failure modes and a backfill guide.
- Project 2: Tabletop drill. Simulate a P1 incident; time each step; note gaps; update your runbook.
- Project 3: Add a partition-level backfill script with guardrails (date range limits, dry-run).
- Project 4: Build a minimal data quality section (row counts, partition existence, critical metric thresholds).
Learning path
- Start: Understand pipeline components and SLAs.
- Then: Draft runbook using the template.
- Next: Add DQ checks and recovery flows.
- Finally: Tabletop test, refine, and handover to on-call.
Exercises
Everyone can do the exercises and quick test for free. Logging in lets you save progress.
Exercise 1 — First-hour action plan
Scenario: Your daily sales DAG failed mid-run. Write the first-hour plan in bullet points.
- Include: detection, triage steps, safe retry, comms, and escalation threshold.
Need a hint?
- Start with what you check in the UI/logs.
- Define how many retries and spacing before escalation.
Exercise 2 — Alert-to-action mapping
Create a mapping for three alerts to concrete actions.
- Alerts: Job SLA breach, Missing partition, Row-count anomaly.
- For each, specify: source of truth to check, action, and when to escalate.
Need a hint?
- Keep actions copy-pasteable.
- Set explicit timers (e.g., escalate after 2 failed retries).
Exercise checklist
- I stated how the issue is detected.
- I listed 3–5 triage steps in order.
- I provided a safe retry or backfill command.
- I wrote who to notify and when to escalate.
- I kept steps short and actionable.
Next steps
- Publish your runbook where on-call can find it.
- Schedule a 30-minute tabletop drill with a teammate.
- Add a quarterly reminder to validate contacts, commands, and SLAs.
Mini challenge
Pick one real pipeline. In 25 minutes, fill the template: Overview, top 3 incidents, one backfill example, and an incident start message. Keep it to one page.