Topic Not Found

Backfills and Reruns: What They Are and Why They Matter

Backfills and reruns are how data teams fix gaps, correct failures, and reprocess history safely. Done well, they keep reports, ML features, and downstream systems accurate without breaking SLAs or duplicating data.

Why this matters in real work

Recover missing partitions after outages or late upstream deliveries.
Rebuild metrics when a bug fix or schema change requires historical recompute.
Rerun a failed DAG task (and dependents) without duplicating side effects like emails or loads.
Keep SLAs by prioritizing the most impactful backfill windows first.

Concept explained simply

Think of your data pipeline as a daily newspaper printer. A rerun is reprinting today’s paper because the first copy smudged. A backfill is reprinting last week’s papers you never printed due to a power outage. Both require making sure subscribers don’t get two copies or the wrong copy.

Mental model

Partitions: Slices of data (e.g., by date) you can process independently.
State: What your system believes happened (e.g., latest watermark).
Idempotency: Running the same job again should not change the correct final result.
Lineage: Know which upstream inputs created which outputs so you can rebuild safely.

Key concepts and terms

Backfill: Running jobs for a historical range (e.g., 2023-12-01 to 2023-12-31) to populate or correct data.
Rerun: Running a specific failed run or task again, typically for a single schedule instance.
Watermark: The furthest point in time processed (high watermark) or earliest unprocessed event (low watermark).
Deduplication: Guardrail to avoid double inserts/side-effects (e.g., merge/upsert, constraint keys).
Recompute scope: Which layers to rebuild (raw → staged → warehouse → feature store → downstream reports).
Direction: Backfill forward (oldest to latest for dependency warm-up) or backward (latest to oldest for quick business impact).

Worked examples

Example 1 — Fill 7 missing daily partitions after API outage

Identify missing dates: 2026-01-01 to 2026-01-07.
Validate upstream availability (API backfill endpoint or archived snapshots).
Plan direction: oldest → newest so rollups have complete history before recent aggregations.
Run extract for each date writing to a dated raw partition (e.g., raw/events/dt=YYYY-MM-DD).
Transform with idempotent writes (merge/upsert on natural key + dt).
Rebuild aggregates only for those dates to limit compute cost.
Verify counts and key invariants vs. source for each date; compare to adjacent days for anomaly detection.

Success signal: partition counts match, no duplicate keys, dashboards reflect corrections for those dates only.

Example 2 — Rerun a failed daily DAG without resending emails

Root cause: transform step failed due to temp storage full; notification step didn’t run.
Fix storage issue; mark only the failed transform task for rerun.
Ensure notification task checks an idempotency key (report_date + report_type). If key already used, skip send.
Trigger rerun of the failed task; let downstream tasks follow.
Verify: exactly one email for that date; audit table shows single idempotency key usage.

Example 3 — Schema change requires recomputing 90 days

Change: new price normalization logic affects revenue metrics.
Scope: 90 days of staged and marts layers; raw unchanged.
Strategy: write recomputed outputs to side-by-side v2 tables or versioned partitions to avoid breaking readers.
Rollout: backfill oldest → newest in batches of 5 days to control load; run data quality checks per batch.
Cutover: point BI to v2 tables when 100% complete and validated; keep v1 for rollback for 1–2 weeks.

Step-by-step: plan and execute a safe backfill

Define goal and scope: dates, datasets, layers, and success criteria.
Choose direction: oldest→newest for dependency warm-up; newest→oldest for immediate business impact.
Freeze inputs: pin source snapshots or queries by date to ensure reproducibility.
Make writes idempotent: use merge/upsert with deterministic keys; avoid blind inserts.
Batch safely: chunk by day/week; monitor resource usage and SLAs.
Validate: counts, null rates, totals, primary-key uniqueness, and sample business checks.
Communicate: announce windows, potential delays, and completion status to stakeholders.
Document: record commands, ranges, checks, and outcomes for future audits.

Pre-flight checklist

[ ] Exact date/time range defined and agreed.
[ ] Upstream data availability confirmed or snapshot pinned.
[ ] Idempotent write path (merge/upsert) implemented.
[ ] Side-effects guarded by idempotency keys (e.g., emails, external loads).
[ ] Resource limits considered; batch size chosen.
[ ] Validation rules written (counts, keys, business totals).
[ ] Rollback or cutover plan documented.

Exercises

Complete these to internalize the concepts. Your answers will not be auto-graded here, but you will use the quick test at the end.

Exercise 1: Design a safe 14-day backfill after a source outage. See the exercise block below for details.
Exercise 2: Rerun a DAG with a side-effectful step while avoiding duplicates. See the exercise block below for details.

Hints

Always state your direction (oldest→newest or newest→oldest) and why.
List validations per batch and the success threshold (e.g., within 1% of source counts).
Describe how you will prevent duplicate side-effects.

Common mistakes and self-check

Blind inserts during reprocessing: leads to duplicates. Self-check: do you have a merge/upsert keyed by business ID + partition?
Recomputing everything unnecessarily: wastes compute. Self-check: can you narrow to affected partitions/layers?
No isolation for schema changes: breaks readers. Self-check: are you writing to versioned tables/paths and planning a cutover?
Skipping validations: silent data drift. Self-check: do you have both technical (PK uniqueness) and business checks (revenue totals)?
Uncontrolled side-effects: duplicate notifications/loads. Self-check: does every external action use an idempotency key?

Practical projects

Build a mini pipeline that ingests daily CSVs to a warehouse with partitioned tables. Add a command to backfill any date range idempotently.
Create a metric aggregation job with watermarks and a validation report that compares source vs. transformed counts per day.
Implement a notification step guarded by an idempotency table, then demonstrate safe reruns by triggering the job twice.

Who this is for

Data Engineers owning scheduled ETL/ELT pipelines.
Analytics Engineers maintaining marts and BI refreshes.
Platform Engineers supporting orchestration reliability.

Prerequisites

Basic orchestration knowledge (DAGs, tasks, dependencies, schedules).
SQL proficiency and familiarity with partitioned data.
Understanding of merges/upserts and primary keys.

Learning path

Learn partitioning and scheduling basics.
Implement idempotent writes and validation checks.
Practice small backfills and reruns in a sandbox.
Handle schema/versioned cutovers.

Next steps

Automate a backfill runner that takes start/end dates and batch size.
Add dashboards for backfill progress and data quality results.
Document a standard operating procedure (SOP) for backfills and reruns.

Mini challenge

Your product team found a bug in currency conversion affecting the last 45 days. Propose a backfill plan in 5 bullet points covering scope, direction, idempotency, validation, and cutover. Keep it to 5 minutes.

Quick Test

Take the short test below to check your understanding. Anyone can take it for free; only logged-in users will have their progress saved.

Menu

Backfills And Reruns

Table of Contents

Backfills and Reruns: What They Are and Why They Matter

Why this matters in real work

Concept explained simply

Mental model

Key concepts and terms

Worked examples

Step-by-step: plan and execute a safe backfill

Pre-flight checklist

Exercises

Common mistakes and self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Design a safe 14-day backfill after a source outage

Instructions

Expected Output

Rerun a failed DAG with a side-effectful step

Backfills And Reruns — Quick Test

Have questions about Backfills And Reruns?

AI Assistant