Topic Not Found

Who this is for

ETL Developers who deploy and support scheduled data pipelines.
Data/Analytics Engineers preparing handover to operations or on-call.
DataOps/Platform engineers standardizing incident response.

Prerequisites

Basic understanding of your scheduler (e.g., DAGs, jobs, retries, SLA).
Comfort with SQL and bash/CLI for checks and fixes.
Awareness of your data sources, targets, and access permissions.

Why this matters

In production, things break at 2 a.m. A clear runbook turns panic into a process. It shortens downtime, reduces data defects, and makes handovers safe.

On-call can triage and fix common failures without guessing.
New team members can perform safe start/stop, backfills, and rollbacks.
Stakeholders get consistent communication and timelines.

Concept explained simply

An operational runbook is a step-by-step guide for running, monitoring, and recovering ETL pipelines. It covers what to watch, what can go wrong, how to respond, and when to escalate.

Mental model

Think of a runbook as three layers:

Detect: How you notice an issue (alerts, dashboards, checks).
Decide: The triage path (Is this transient? data issue? infra?).
Do: The exact commands and steps (retry, backfill, rollback, notify).

Core sections your runbook should include

Overview: Purpose, owners, data flow diagram (brief description).
Schedules & SLAs: When it runs, expected duration, deadlines.
Dependencies: Upstream/downstream jobs, external systems.
Monitoring & Alerts: What alerts fire; where to check health.
Triage Decision Tree: If X then Y steps, with stop rules.
Standard Operations: Start/stop, manual trigger, safe retry.
Recovery: Backfill, reprocess, rollback, data repair steps.
Data Quality: Critical checks and what to do on failures.
Access & Secrets: How to get access; what not to do.
Change & Release: How changes are deployed; rollback plan.
Communication: Who to notify, templates for updates.
Escalation Matrix: Contacts and time-based thresholds.
Audit Trail: Where to record incidents and resolutions.

Worked examples

Example 1 — Source API rate limiting (HTTP 429)

Symptom: Extract task fails with 429; retries keep failing.

Detect: Alert "extract_task failed 3 times with 429".
Decide: Transient vs policy. Check API status page notes (if available internally) and job logs for rate-limit headers.
Do:
- Pause job retries for 20 minutes (avoid thundering herd).
- Lower concurrency for this job to 1; increase backoff to 10m.
- Trigger one retry. If success, resume schedule.
- If still failing after 2 hours: escalate to integration owner.
Data repair: If partial day extracted, run targeted backfill for the missed window only.
Comms: "Source rate-limited requests; applied backoff; ETA +30m. No downstream impact expected."

Example 2 — Late-arriving partition

Symptom: Daily partition for dt=2026-01-10 missing by 03:00 SLA.

Detect: DQ check "expected partition exists" failed.
Decide: Upstream delay vs our load issue. Check upstream job status and landing bucket for the partition.
Do:
- If upstream is late: set downstream to wait, extend SLA +2h, notify analytics consumers.
- If data present but load failed: rerun loader task for that partition only.
Data repair: Verify row counts vs control table; run reconciliation query.
Comms: "Partition delay; new ETA 05:00. Dashboards for date 2026-01-10 may be stale."

Example 3 — Schema change added a nullable column

Symptom: Transform task fails with "unknown column" in select list.

Detect: Alert from transform job and schema drift check.
Decide: Backward compatible? If column added and nullable, quick fix may be safe.
Do:
- Hotfix mapping to include the column with default or ignore safely.
- Re-run transform for the failed partition.
Rollback: If hotfix fails, revert mapping to last known version and run.
Follow-up: Create change request to formalize schema update and tests.
Comms: "Upstream added column; applied safe mapping; data complete."

Example 4 — Orchestrator outage during run

Symptom: Scheduler UI unreachable; jobs mid-flight.

Detect: Heartbeat alert; no logs updating.
Decide: Infra incident. Stop manual intervention unless data corruption risk.
Do:
- Confirm with platform team. Avoid double-triggering tasks.
- Once restored, mark running tasks as failed and re-run from last safe checkpoint per runbook.
Data repair: Backfill missed windows.
Comms: "Platform incident; will backfill on recovery; ETA depends on platform."

How to build a solid runbook

Step 1: Draft a one-page overview (purpose, owners, schedule, SLA, data map).

Step 2: List top 5 failure modes with exact triage steps and safe retries.

Step 3: Document recovery flows (backfill, reprocess, rollback) with commands.

Step 4: Add communication templates and escalation thresholds.

Step 5: Run a tabletop drill and fix gaps found.

Copy-paste runbook template (short)

Overview
- Name:
- Purpose:
- Owners (primary, backup):
- Schedule & SLA:
- Data flow (short text):

Monitoring & Alerts
- Dashboards:
- Alerts (name -> trigger -> action):

Triage Decision Tree (top 5)
- If [extract 429] -> pause retries 20m -> lower concurrency -> retry -> escalate after 2h.
- If [missing partition] -> check upstream -> wait or rerun loader partition-only.
- If [schema drift] -> apply safe mapping -> re-run -> rollback if fails.

Standard Ops
- Start/Stop:
- Manual trigger:
- Safe retry policy:

Recovery
- Backfill command examples:
- Reprocess scope:
- Rollback steps:

Data Quality
- Critical checks:
- What to do on failure:

Access & Escalation
- How to get access:
- P0/P1/P2 thresholds and contacts:

Communication Templates
- Incident start:
- Update cadence:
- Resolution summary:

Audit
- Where to log incident notes:

Common mistakes and self-check

Too much theory, not enough commands. Fix: Add exact CLI/UI paths and example parameters.
Missing rollback. Fix: Include revert steps for configs and data.
Stale contacts. Fix: Add monthly contact check reminder.
Ambiguous scope. Fix: Define boundaries (what this runbook covers vs not).
Unverified steps. Fix: Run a quarterly tabletop and update notes.
Tool-only focus. Fix: Include data checks, not just orchestrator steps.

Self-check checklist

Can a new on-call person restore a failed run without help?
Are commands copy-pasteable with placeholders clearly marked?
Is there a decision tree for top 5 incidents?
Does it include comms templates and escalation timers?
Have you tested backfill and rollback recently?

Practical projects

Project 1: Create a runbook for one production DAG. Include at least three failure modes and a backfill guide.
Project 2: Tabletop drill. Simulate a P1 incident; time each step; note gaps; update your runbook.
Project 3: Add a partition-level backfill script with guardrails (date range limits, dry-run).
Project 4: Build a minimal data quality section (row counts, partition existence, critical metric thresholds).

Learning path

Start: Understand pipeline components and SLAs.
Then: Draft runbook using the template.
Next: Add DQ checks and recovery flows.
Finally: Tabletop test, refine, and handover to on-call.

Exercises

Everyone can do the exercises and quick test for free. Logging in lets you save progress.

Exercise 1 — First-hour action plan

Scenario: Your daily sales DAG failed mid-run. Write the first-hour plan in bullet points.

Include: detection, triage steps, safe retry, comms, and escalation threshold.

Need a hint?

Start with what you check in the UI/logs.
Define how many retries and spacing before escalation.

Exercise 2 — Alert-to-action mapping

Create a mapping for three alerts to concrete actions.

Alerts: Job SLA breach, Missing partition, Row-count anomaly.
For each, specify: source of truth to check, action, and when to escalate.

Need a hint?

Keep actions copy-pasteable.
Set explicit timers (e.g., escalate after 2 failed retries).

Exercise checklist

I stated how the issue is detected.
I listed 3–5 triage steps in order.
I provided a safe retry or backfill command.
I wrote who to notify and when to escalate.
I kept steps short and actionable.

Next steps

Publish your runbook where on-call can find it.
Schedule a 30-minute tabletop drill with a teammate.
Add a quarterly reminder to validate contacts, commands, and SLAs.

Mini challenge

Pick one real pipeline. In 25 minutes, fill the template: Overview, top 3 incidents, one backfill example, and an incident start message. Keep it to one page.

Menu

Operational Runbooks

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Worked examples

How to build a solid runbook

Common mistakes and self-check

Practical projects

Learning path

Exercises

Exercise 1 — First-hour action plan

Exercise 2 — Alert-to-action mapping

Next steps

Mini challenge

Practice Exercises

First-hour action plan for a failed daily DAG

Instructions

Expected Output

Alert-to-action mapping

Operational Runbooks — Quick Test

Have questions about Operational Runbooks?

AI Assistant