Why this matters
Schedulers (Airflow, ADF, Prefect, cron) run critical data pipelines. When a job fails at 02:00, people need a clear, safe set of steps to recover quickly. A good runbook turns panic into repeatable action, reduces downtime, and prevents guesswork. As an ETL Developer, you will write and maintain these runbooks so on-call responders can diagnose, fix, and communicate issues confidently.
Concept explained simply
A runbook is a short, structured, step-by-step guide that tells someone: what broke, how to confirm it, how to fix it, and how to prevent it next time. It is not a novel; it’s the minimum practical instructions to restore service safely.
Mental model
- Symptom: What you see (alert, error message, metric spike).
- Impact: Who/what is affected (SLAs, downstream tables, dashboards).
- Immediate actions: Safe things anyone can do now to stabilize (pause, mute noisy alerts, clear stuck runs).
- Diagnosis: Short checks to find the root cause (logs, inputs, credentials, capacity).
- Remediation: The fix (rerun, backfill, redeploy, rotate secret).
- Validation: How to verify success (row counts, data freshness, DAG state).
- Communication: Who to notify and what to say.
- Prevention & follow-up: How to avoid repeat incidents.
- Escalation: When and to whom to hand off.
Runbook essentials (include these sections)
- Title & Last updated
- Scope: Which pipeline(s), environment(s), and scheduler(s)
- Triggers/Symptoms: Common alerts, error text
- Business Impact: Affected data products and SLAs
- Immediate Safe Actions
- Diagnosis Steps
- Remediation Steps
- Rollback/Undo (if applicable)
- Validation Checklist
- Communication Template
- Escalation Matrix
- Prevention/Follow-up tasks
- Access/Permissions needed
Worked examples
Example 1: Airflow task fails with "S3 key not found"
Context: Daily sales pipeline depends on S3 file s3://bucket/sales/2024-05-10.csv
- Immediate actions: Pause downstream tasks in the DAG run to avoid partial loads.
- Diagnosis: Check S3 for today’s file; confirm file size & last modified. Review upstream job that publishes the file. Inspect Airflow task logs for exact missing key.
- Remediation: If upstream is late, wait up to 30 minutes; then trigger a recheck task. If file is missing permanently, coordinate to re-publish or backfill the file. Update DAG run to retry the failed task.
- Validation: Confirm task success, table row count matches last 7-day average ±10%, dashboards refresh without errors.
- Communication: Notify #data-ops: “Sales DAG delayed due to late S3 drop; re-run in progress; ETA 20 min.”
- Prevention: Add alert on upstream object arrival; set a SL Aware Sensor with timeout + clear message.
Example 2: Credential expired for warehouse connection
- Immediate actions: Stop retries to avoid lockouts. Tag incident as auth-related.
- Diagnosis: Attempt a manual connection with a test user. Check secret manager for expiry. Review last successful connection time.
- Remediation: Rotate credential in the secret manager. Redeploy scheduler connections. Trigger a small test task, then resume the DAG or backfill as needed.
- Validation: Confirm successful login, run a simple SELECT 1, then run a small partition load.
- Communication: “Warehouse credential rotated; resuming runs; monitoring for 30 minutes.”
- Prevention: Set secret rotation reminders and non-breaking pre-expiry alerts.
Example 3: Upstream API rate limiting causes timeouts
- Immediate actions: Reduce parallelism for the extractor tasks.
- Diagnosis: Review error codes (429), request counts, and retry headers.
- Remediation: Increase backoff, add jitter, respect Retry-After header; split large pulls into smaller windows; resume runs.
- Validation: Success rate > 99%, run time within 1.5Ă— normal, no 429 spikes.
- Communication: “API limited; throttling applied; new ETA 1 hour.”
- Prevention: Permanent rate-limit config; nightly window staggering.
Example 4: Late partition triggers downstream freshness breach
- Immediate actions: Prevent downstream publishes until the late partition is ready.
- Diagnosis: Identify missing partition (ds=2024-05-10). Check upstream job status and storage for that partition.
- Remediation: Backfill only the missing partition; avoid reprocessing full history.
- Validation: Partition table shows complete partitions for last 3 days; dashboards reflect today’s data.
- Communication: “Backfilled ds=2024-05-10; downstream unblock in 15 minutes.”
- Prevention: Add partition completeness checks before publish stage.
Templates you can copy
Minimal Pipeline Runbook Template
Title: [Pipeline Name] Runbook Last updated: [YYYY-MM-DD] Scope: [Env, Scheduler, DAG/Job IDs] Triggers/Symptoms: - [Alert name or error text] - [Metric threshold] Business Impact: - [Affected tables/dashboards] - [SLA and severity] Immediate Safe Actions (do first): - [Pause downstream / stop retries / mute noisy alerts] Diagnosis: 1) Check logs at [task/job] 2) Verify inputs [path/object/row counts] 3) Validate credentials/connectivity Remediation: - If [condition A]: do [steps] - If [condition B]: do [steps] Rollback/Undo (if any): - [How to revert changes] Validation Checklist: - [Task green / table row count / freshness OK] Communication Template: "Incident: [pipeline] [status]. Impact: [what]. ETA: [time]. Owner: [name]." Escalation: - After [X mins] or [criteria], page [team/person] Prevention/Follow-up: - [Alert improvements, guardrails] Access Required: - [Secret manager, scheduler UI, warehouse role]
Ops Note (concise incident message) Template
[Time] [Pipeline/Job]: [Issue summary] Impact: [Data/products affected] Cause (suspected): [Cause] Action: [What you did] ETA/Next update: [Time] Owner: [Name or rotation]
How to write a runbook in 10 minutes
- Write the symptom and the one-sentence impact.
- List 3 immediate safe actions anyone can do.
- Add the top 3 diagnosis checks (inputs, credentials, logs).
- Define the two most likely remediation paths with bullet steps.
- Add a 3-point validation checklist.
- Paste the Ops Note template for comms.
- State escalation criteria (time or severity).
Common mistakes and how to self-check
- Too long, not actionable: Keep steps short; use bullets, not paragraphs.
- Missing validation: Always specify how to confirm success.
- No safe first actions: Stabilize before deep diagnosis.
- Assumes expert knowledge: Include exact paths, task names, and where to click.
- Forgets communication: Provide a ready-to-send message.
- No escalation rule: Define a time or symptom threshold to escalate.
Who this is for
- ETL Developers and Data Engineers operating scheduled pipelines.
- On-call responders for data platforms.
- Analyst engineers who publish data products on SLAs.
Prerequisites
- Basic understanding of your scheduler (e.g., DAGs, tasks, retries).
- Access to logs, storage, and warehouse environments.
- Familiarity with your team’s incident channel and escalation policy.
Learning path
- Start: Draft minimal runbooks for your top 3 pipelines.
- Next: Add validation metrics and partition checks.
- Then: Introduce backfill strategies and safe rollback steps.
- Finally: Standardize comms and escalation across pipelines.
Practical projects
- Create a runbook pack: three pipelines, each with diagnostics, remediation, and comms templates.
- Drill: Simulate a failure and run the playbook end-to-end with a teammate.
- Guardrails: Add a pre-publish validation task based on your runbook checks.
Exercises
Note: Anyone can do the exercises and test for free. Only logged-in users will have their progress saved.
- Exercise 1: Draft a minimal runbook for a failed daily customer_orders pipeline due to a missing upstream file. Include: Symptoms, Impact, Immediate Actions, Diagnosis, Remediation, Validation, Communication, Escalation.
- Exercise 2: Convert a verbose incident blurb into a concise Ops Note using the template provided above.
- Exercise 3: Write a short decision tree (pseudocode or bullets) for choosing between rerun, backfill, or skip for a late partition.
Self-check checklist
Next steps
- Add your runbooks to your team wiki or repo. Mark owners and last updated dates.
- Run a 30-minute tabletop incident review monthly to keep them current.
- Automate one validation check directly in your pipeline.
Mini challenge
Pick one critical DAG. In 15 minutes, write the minimal runbook using the template. Ask a teammate outside your team to follow it. If they get stuck, improve the step they stumbled on.
Quick Test
Try the quick test below. It’s available to everyone; logged-in users will have their score saved to their learning progress.