Who this is for
Analytics Engineers, BI Developers, and Data Ops folks who maintain pipelines, models, and dashboards and want fast, consistent recovery from failures and safe backfills.
Prerequisites
- Basic understanding of scheduled pipelines and data warehouses.
- Comfort with SQL and data model dependencies.
- Familiarity with job logs, alerts, and dataset partitioning.
Why this matters
Real tasks you will face:
- A daily job fails at 6:00 AM and a stakeholder needs accurate numbers by 9:00 AM.
- An upstream schema change breaks downstream models; you must triage, patch, and backfill.
- Late-arriving data leaves gaps; you must reprocess only the missing partitions safely.
Runbooks reduce stress, shorten downtime, and prevent repeated mistakes by giving clear steps anyone on the team can follow.
Concept explained simply
A runbook is a short, actionable guide that tells you exactly what to do when a thing breaks or when you need to reprocess historical data. It includes detection, triage, fixes, validation, communication, and prevention.
Mental model
Think of a runbook as a rescue checklist for a specific failure type or a controlled replay plan for backfills. It should be easy to find, quick to scan, and safe to execute.
Core components of a failure/backfill runbook
- Symptoms and scope: What broke and which data is affected.
- Impact and urgency: Which reports/users are impacted; any deadlines.
- Owner and escalation: On-call person, when to escalate, who approves risky steps.
- Pre-checks: Verify alerts, last successful run, upstream freshness, warehouse health.
- Fix steps: Recreate, rerun, or patch with clear commands or UI actions.
- Backfill steps: Window to reprocess, partition strategy, idempotency approach.
- Validation: Row counts, sample queries, data tests, reconciliation against sources.
- Communication: Short, copy-paste status messages before/after the fix.
- Prevention: Root cause notes and follow-up tasks.
- Timebox: Expected time to recovery and checkpoints.
Minimal runbook template
Title: [Failure or Backfill Name] Last updated: [YYYY-MM-DD] Owner: [Name or team] 1) Trigger - Symptoms: - Affected tables/models/dashboards: - Earliest affected date/partition: 2) Impact & urgency - Who is impacted: - SLA or deadline: 3) Pre-checks (run before any fix) - Confirm alert and error message - Check upstream freshness - Check last successful run ID/timestamp 4) Fix steps - Step 1: - Step 2: - Step 3: 5) Backfill plan (if data gap exists) - Window: [start to end] - Partitions: [date or id range] - Idempotency method: [truncate/merge/insert overwrite] 6) Validation - Data tests to run: - Row count deltas: - Spot-check queries: 7) Communication - Initial notice: - Recovery notice: 8) Prevention - Root cause: - Follow-up actions:
Worked examples
Example 1: Daily transformation failed due to upstream missing partition
- Symptoms: 2025-05-10 partition missing in upstream; downstream model failed at 06:10.
- Impact: Sales dashboard stale for today; finance daily meeting at 09:00.
- Pre-checks: Confirm upstream load delay; last successful upstream run was 2025-05-09.
- Fix steps: Trigger upstream catch-up for 2025-05-10; verify partition appears; rerun the failed transformation.
- Backfill: Not needed if only a single late partition; otherwise reprocess 2025-05-10 only.
- Validation: Compare row counts vs typical day ±10%; verify key KPIs for 2025-05-10.
- Communication: Initial note (incident acknowledged, ETA 30–45 min). Recovery note once KPIs validated.
- Prevention: Add freshness alert threshold for upstream delay, document dependency.
Example 2: Schema change broke downstream models
- Symptoms: Upstream renamed column user_id to customer_id; downstream models failing with missing column error.
- Impact: All user-facing metrics for acquisition are stale since change time.
- Pre-checks: Confirm schema diff; check when the change landed.
- Fix steps: Update downstream models to reference customer_id; provide compatibility mapping if both columns exist temporarily.
- Backfill: Rebuild affected models from change timestamp to present.
- Validation: Run data tests for non-null ids, referential integrity; reconcile totals vs source.
- Communication: Notify stakeholders about schema rename and validation completion.
- Prevention: Propose contract checks or backward-compatible deprecation window.
Example 3: Late-arriving events causing gaps
- Symptoms: Event table receives late data up to 3 days after event_time; daily aggregates undercount yesterday.
- Impact: Marketing spend optimization uses undercounted conversions.
- Pre-checks: Measure typical lateness distribution; confirm yesterday’s gap.
- Fix steps: None immediately; plan sliding backfill policy (rebuild last 3 days every morning).
- Backfill: Recompute aggregates for D-3..D-1 each day using deterministic merge.
- Validation: Check cumulative conversions vs last week trend; spot-check event_id distinct counts.
- Communication: Inform stakeholders of sliding window refresh policy and expected stability.
- Prevention: Document lateness SLA; adjust dashboards to display data confidence badges.
Example 4: Erroneous duplicate loads
- Symptoms: Duplicate rows due to retry without idempotency.
- Impact: Metrics doubled for a 2-hour window.
- Pre-checks: Identify affected time window and tables; confirm duplicates by unique key frequency.
- Fix steps: Apply de-duplication logic (ROW_NUMBER over key and timestamp); replace affected partitions.
- Backfill: Overwrite only impacted partitions to restore correctness.
- Validation: Unique key check; compare to source counts.
- Communication: Brief incident report with corrected window.
- Prevention: Implement insert-overwrite or merge-on-key with retries.
Backfills: principles and patterns
- Smallest safe scope: Reprocess only affected partitions or dates.
- Idempotency first: Use merge-on-key, insert overwrite, or truncate-and-rebuild per partition to make reruns safe.
- Determinism: Calculations should produce the same results for the same inputs.
- Order of operations: Upstream to downstream to avoid partial views.
- Concurrency control: Pause dependent jobs while backfill runs to prevent clashes.
- Resource limits: Batch backfills to avoid warehouse overload; monitor cost and queues.
- Audit trail: Record what was backfilled, who approved, and validation outcomes.
Validation and quality gates
- Row counts and deltas vs typical day or source system.
- Uniqueness and not-null tests on keys.
- Freshness checks for critical datasets.
- Reconciliation queries for key KPIs.
- Canary checks: rebuild a small sample before full backfill.
Copy-paste validation snippet ideas
-- Row count delta by day SELECT date, count(*) AS rows FROM my_table WHERE date BETWEEN '2025-05-01' AND '2025-05-10' GROUP BY date ORDER BY date; -- Uniqueness check SELECT key, COUNT(*) c FROM my_table WHERE date = '2025-05-10' GROUP BY key HAVING COUNT(*) > 1; -- KPI reconciliation SELECT SUM(amount) AS kpi FROM fact_sales WHERE date = '2025-05-10';
Communication templates
Initial incident notice
Subject: Data incident acknowledged – [Dataset/Model] We detected a failure affecting [scope]. We are investigating. ETA for update: [time]. Owner: [name].
Recovery notice
Subject: Incident resolved – [Dataset/Model] Fix applied and validated (tests passed: [list]). Data is current through [timestamp]. If you see anomalies, reply to this thread.
Planned backfill notice
Subject: Planned backfill – [Dataset/Model] We will reprocess [window] to correct gaps. Expect intermittent freshness during [time range]. Outcome and validation summary will follow.
Step-by-step: create your first runbook
- Pick a common incident: e.g., daily job fails due to upstream delay.
- Fill the template: symptoms, impact, owner, pre-checks, fix steps, backfill plan, validation, communication, prevention.
- Dry run: simulate the scenario in a sandbox or using a recent day.
- Peer review: ask a teammate to follow the runbook without help; refine unclear steps.
- Publish and tag: store with clear naming and last-updated date.
- Drill quarterly: practice and update based on new tools or changes.
Exercises
Do these to make the skill stick. You can compare with the solutions below.
Exercise 1: Draft a failure runbook
Scenario: The daily customer dimension failed because the upstream source delivered a new column and delayed the latest partition.
- Define symptoms, scope, and impact.
- Write pre-checks and fix steps.
- Add a backfill window if needed.
- List validation checks and two communication messages.
Hints
- Scope to the exact date partition.
- Backfill only if a gap remains after the fix.
Show solution
Symptoms: customer_dim failed at 06:05; upstream missing 2025-06-01; new column added. Impact: Dashboards using customer_dim stale for 2025-06-01; CS team needs by 09:00. Pre-checks: Confirm missing partition; confirm column rename or addition; verify last successful run. Fix steps: Trigger upstream catch-up for 2025-06-01; update model to select new column or map old name; rerun customer_dim. Backfill: Rebuild only 2025-06-01. Validation: Row count vs D-7; not-null on customer_id; sample 20 records for new column population. Communication: Initial notice with ETA; recovery notice after tests pass. Prevention: Add schema compatibility check; alert on missing partition.
Exercise 2: Plan a 30-day backfill safely
Scenario: A bug undercounted revenue for the last 30 days. You fixed the logic. Plan a safe backfill.
- Choose idempotency strategy.
- Define partition batches and order.
- List validation after each batch.
- Write the stakeholder update template.
Hints
- Prefer merge-on-key or insert-overwrite per day.
- Batch to avoid warehouse overload.
Show solution
Idempotency: Insert-overwrite per day for fact_revenue partitions. Batches: 5-day chunks from oldest to newest (D-30 to D-1). Order: Pause downstream aggregates; process upstream then downstream. Validation: Row count delta within expected range; sum(amount) vs source extract; uniqueness on order_id. Communication: Start notice with schedule; after each batch, post progress and any anomalies; final recovery summary.
Self-check checklist
- [ ] Did you specify the smallest safe backfill window?
- [ ] Are steps ordered upstream to downstream?
- [ ] Is the rerun idempotent and repeatable?
- [ ] Do you have at least two validation checks?
- [ ] Did you include clear stakeholder messages?
Common mistakes and how to self-check
- Backfilling too much: Self-check: Can you restrict to partitions or dates? If yes, do it.
- Skipping validation: Self-check: List specific tests before you touch data.
- Breaking idempotency: Self-check: Can you rerun without duplicates or drift?
- Forgetting dependencies: Self-check: Is the upstream fresh and downstream paused as needed?
- Silence during incidents: Self-check: Have you drafted initial and recovery notes?
Practical projects
- Incident library: Create three runbooks for your most common failures (missing partitions, schema changes, late data).
- Backfill playbooks: Write two templates (short window, long window with batching) and test them on sample data.
- Validation pack: Build reusable SQL snippets for counts, uniqueness, and KPI reconciliation.
Learning path
- Learn typical failure types and signals (alerts, logs, tests).
- Draft the minimal runbook template for your stack.
- Create one failure runbook and one backfill runbook.
- Do a dry run in a sandbox.
- Peer review and improve wording and order of steps.
- Schedule quarterly drills; update templates with lessons learned.
Mini challenge
Pick a real job in your environment. Assume yesterday’s partition failed. In 10 minutes, write the pre-checks and fix steps only. Keep it to 7 bullet points. Share with a teammate and ask if they could execute it without your help.
Next steps
- Convert your most recent incident into a documented runbook.
- Add validation queries as copy-paste blocks inside the runbook.
- Schedule a short drill to practice one runbook this week.
Ready to test yourself?
Take the quick test below to check your understanding. The quick test is available to everyone; only logged-in users get saved progress.