luvv to helpDiscover the Best Free Online Tools
Topic 6 of 8

Backfills And Reruns

Learn Backfills And Reruns for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

In real ETL work, data arrives late, code changes, and failures happen. Backfills and reruns let you safely reprocess past data without breaking dashboards or duplicating rows. You will use them to fix bugs retroactively, load new historical sources, correct late events, and recover from outages.

Who this is for

  • ETL Developers and Data Engineers responsible for production pipelines
  • Analysts/Analytics Engineers who maintain scheduled data models
  • Anyone preparing to operate DAGs and partitioned data in production

Prerequisites

  • Comfort with SQL (SELECT, INSERT, DELETE, MERGE/UPSERT, partitions)
  • Understanding of batch scheduling (daily/hourly runs) and DAG concepts
  • Basic data warehouse or lake experience (tables, partitions, file paths)

Concept explained simply

Backfill = reprocess a range of past dates/partitions to fix or load historical data. Example: rebuild daily data for the last 30 days after a bug fix.

Rerun = re-execute a specific job run (e.g., a single day) that failed or produced wrong output.

Both require idempotency (safe to run multiple times) and determinism (same input yields same output). That keeps your datasets clean even when you retry.

Mental model

  • Think in partitions (often by date). Each run writes to exactly one partition.
  • Maintain a watermark or state to know what range was processed.
  • Use overwrite or merge for each partition so repeats do not duplicate rows.
  • Orchestrator triggers many partitioned runs; storage enforces correctness.
Key terms
  • Idempotent write: MERGE/INSERT OVERWRITE partition, not raw append.
  • Deterministic job: no randomness, no "now()" unless parameterized by run date.
  • Event-time vs processing-time: choose based on business logic; late data implies event-time backfills.
  • Watermark: last processed event-time or date; used to resume work.
  • Blast radius: the scope of data/jobs affected by a change. Keep it small when backfilling.

Worked examples

Example 1: 14-day partition backfill after a bug fix

  1. Freeze new deployments to the affected jobs.
  2. Prepare idempotent writes (e.g., MERGE into daily partition by date).
  3. Run backfill for dates D-14 to D-1 with limited concurrency (e.g., 3 at a time).
  4. Validate row counts and key metrics for each day before moving on.
  5. Unfreeze deployments and resume the regular schedule.
SQL pattern
MERGE INTO mart.fact_orders t
USING staging.orders s
ON t.order_id = s.order_id AND t.order_date = s.order_date
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)
VALUES (...);
-- Run per partition (order_date = '{{ ds }}') to keep reruns safe.

Example 2: Rerun a failed hourly job safely

  1. Identify the exact run window (e.g., 2025-03-05 10:00–10:59 UTC).
  2. Delete or overwrite the target partition for that hour, not the whole table.
  3. Rerun the job for that hour only; confirm idempotent behavior.
Overwrite strategy
-- For a file-based lakehouse
-- Write to /table/date=2025-03-05/hour=10/ with overwrite mode.
-- For a warehouse
DELETE FROM mart.hourly_events WHERE date='2025-03-05' AND hour=10;
INSERT INTO mart.hourly_events ... WHERE event_ts BETWEEN '2025-03-05T10:00' AND '2025-03-05T10:59:59';

Example 3: Late-arriving data and event-time backfill

  1. Detect late events in a landing table that fall into historical dates.
  2. Compute the smallest and largest affected event_date.
  3. Backfill those specific event_date partitions using MERGE logic.
  4. Rebuild downstream aggregates that depend on those dates only.
Downstream refresh tip

Track dependencies by partition. If dim_date=2025-03-01 changed, only rebuild aggregate partitions that reference that date. This reduces cost and time.

How to run backfills and reruns reliably

  • Always write idempotently (MERGE or INSERT OVERWRITE partition).
  • Parameterize runs by the partition key (e.g., run_date) and avoid using current timestamps within transformations.
  • Throttle concurrency to protect your warehouse and source systems.
  • Add guardrails: row-count checks, primary-key uniqueness checks, and before/after metric comparisons.
  • Log what you changed: date ranges, code version, and validation outcomes.

Hands-on exercises

The quick test is available to everyone. Only logged-in users will have their progress saved.

Exercise 1: Plan a 90-day backfill with minimal risk (ex1)

Draft an execution plan for backfilling a daily partitioned fact table for the last 90 days after fixing a logic bug in joins.

What to include
  • Concurrency limits and daily batch size
  • Validation metrics per day
  • Rollback approach if metrics fail
  • Communication note to analysts

Exercise 2: Make a rerun idempotent with SQL MERGE (ex2)

Write a SQL MERGE that safely reruns a single daily partition of a dimension table using business-key deduplication.

Hints
  • Use the natural key (e.g., customer_id) for matching
  • Limit the scope to the partition date
  • Prefer MERGE to avoid duplicate rows

Self-check checklist

  • Your plan or SQL never appends blindly to a historical partition.
  • You can explain how you would verify row counts and key uniqueness.
  • You can limit reruns/backfills to the exact affected partitions.
  • You throttle concurrency to avoid resource contention.

Common mistakes and how to self-check

  • Appending into historical partitions: leads to duplicates. Use MERGE or overwrite. Self-check: rerun the same day twice and confirm identical results.
  • Using current timestamps in logic: makes runs non-deterministic. Self-check: all time references must come from the run parameters.
  • Backfilling too broadly: unnecessary cost and risk. Self-check: list the exact partitions you must rebuild and why.
  • No validation: errors slip through. Self-check: define row-count deltas, uniqueness checks, and key metrics before you run.
  • Overloading systems: running everything at once. Self-check: set max concurrency and monitor warehouse load.

Practical projects

  • Build a partitioned daily ETL for orders and implement a backfill CLI that runs a date range with MERGE writes.
  • Create a data quality notebook that compares metrics pre/post backfill and prints pass/fail per partition.
  • Implement a watermark table that stores last processed event_date and supports resumes after failures.

Learning path

  • Start: Partitioning and idempotent writes
  • Then: Orchestrator parameters and concurrency control
  • Next: Data quality checks and audit logging
  • Finally: Dependency-aware downstream refreshes

Next steps

  • Complete the exercises above.
  • Take the Quick Test below to check your understanding.
  • Apply the patterns to one of your production DAGs in a safe environment.

Mini challenge

Your aggregate table undercounted revenue for the last 7 days due to a join bug. Show the minimum set of partitions and downstream tables you will rebuild, in what order, and how you will validate.

Hint
  • Fix code, pause schedules, backfill source partitions for the 7 days, then rebuild only dependent aggregate partitions.
  • Validate with revenue totals per day and key uniqueness checks before resuming schedules.

Practice Exercises

2 exercises to complete

Instructions

You fixed a logic bug in a fact table that affects the last 90 days. Create a written plan that describes:

  • The exact date range and partition key
  • Concurrency limits and batching strategy
  • Validation checks per partition
  • Rollback criteria and steps
  • Communication notes for stakeholders
Expected Output
A concise plan (bullet points) covering scope, execution steps, validation metrics, rollback, and communication.

Backfills And Reruns — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Backfills And Reruns?

AI Assistant

Ask questions about this tool