Topic Not Found

Why this matters

In real ETL work, data arrives late, code changes, and failures happen. Backfills and reruns let you safely reprocess past data without breaking dashboards or duplicating rows. You will use them to fix bugs retroactively, load new historical sources, correct late events, and recover from outages.

Who this is for

ETL Developers and Data Engineers responsible for production pipelines
Analysts/Analytics Engineers who maintain scheduled data models
Anyone preparing to operate DAGs and partitioned data in production

Prerequisites

Comfort with SQL (SELECT, INSERT, DELETE, MERGE/UPSERT, partitions)
Understanding of batch scheduling (daily/hourly runs) and DAG concepts
Basic data warehouse or lake experience (tables, partitions, file paths)

Concept explained simply

Backfill = reprocess a range of past dates/partitions to fix or load historical data. Example: rebuild daily data for the last 30 days after a bug fix.

Rerun = re-execute a specific job run (e.g., a single day) that failed or produced wrong output.

Both require idempotency (safe to run multiple times) and determinism (same input yields same output). That keeps your datasets clean even when you retry.

Mental model

Think in partitions (often by date). Each run writes to exactly one partition.
Maintain a watermark or state to know what range was processed.
Use overwrite or merge for each partition so repeats do not duplicate rows.
Orchestrator triggers many partitioned runs; storage enforces correctness.

Key terms

Idempotent write: MERGE/INSERT OVERWRITE partition, not raw append.
Deterministic job: no randomness, no "now()" unless parameterized by run date.
Event-time vs processing-time: choose based on business logic; late data implies event-time backfills.
Watermark: last processed event-time or date; used to resume work.
Blast radius: the scope of data/jobs affected by a change. Keep it small when backfilling.

Worked examples

Example 1: 14-day partition backfill after a bug fix

Freeze new deployments to the affected jobs.
Prepare idempotent writes (e.g., MERGE into daily partition by date).
Run backfill for dates D-14 to D-1 with limited concurrency (e.g., 3 at a time).
Validate row counts and key metrics for each day before moving on.
Unfreeze deployments and resume the regular schedule.

SQL pattern

MERGE INTO mart.fact_orders t
USING staging.orders s
ON t.order_id = s.order_id AND t.order_date = s.order_date
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)
VALUES (...);
-- Run per partition (order_date = '{{ ds }}') to keep reruns safe.

Example 2: Rerun a failed hourly job safely

Identify the exact run window (e.g., 2025-03-05 10:00–10:59 UTC).
Delete or overwrite the target partition for that hour, not the whole table.
Rerun the job for that hour only; confirm idempotent behavior.

Overwrite strategy

-- For a file-based lakehouse
-- Write to /table/date=2025-03-05/hour=10/ with overwrite mode.
-- For a warehouse
DELETE FROM mart.hourly_events WHERE date='2025-03-05' AND hour=10;
INSERT INTO mart.hourly_events ... WHERE event_ts BETWEEN '2025-03-05T10:00' AND '2025-03-05T10:59:59';

Example 3: Late-arriving data and event-time backfill

Detect late events in a landing table that fall into historical dates.
Compute the smallest and largest affected event_date.
Backfill those specific event_date partitions using MERGE logic.
Rebuild downstream aggregates that depend on those dates only.

Downstream refresh tip

Track dependencies by partition. If dim_date=2025-03-01 changed, only rebuild aggregate partitions that reference that date. This reduces cost and time.

How to run backfills and reruns reliably

Always write idempotently (MERGE or INSERT OVERWRITE partition).
Parameterize runs by the partition key (e.g., run_date) and avoid using current timestamps within transformations.
Throttle concurrency to protect your warehouse and source systems.
Add guardrails: row-count checks, primary-key uniqueness checks, and before/after metric comparisons.
Log what you changed: date ranges, code version, and validation outcomes.

Hands-on exercises

The quick test is available to everyone. Only logged-in users will have their progress saved.

Exercise 1: Plan a 90-day backfill with minimal risk (ex1)

Draft an execution plan for backfilling a daily partitioned fact table for the last 90 days after fixing a logic bug in joins.

What to include

Concurrency limits and daily batch size
Validation metrics per day
Rollback approach if metrics fail
Communication note to analysts

Exercise 2: Make a rerun idempotent with SQL MERGE (ex2)

Write a SQL MERGE that safely reruns a single daily partition of a dimension table using business-key deduplication.

Hints

Use the natural key (e.g., customer_id) for matching
Limit the scope to the partition date
Prefer MERGE to avoid duplicate rows

Self-check checklist

Your plan or SQL never appends blindly to a historical partition.
You can explain how you would verify row counts and key uniqueness.
You can limit reruns/backfills to the exact affected partitions.
You throttle concurrency to avoid resource contention.

Common mistakes and how to self-check

Appending into historical partitions: leads to duplicates. Use MERGE or overwrite. Self-check: rerun the same day twice and confirm identical results.
Using current timestamps in logic: makes runs non-deterministic. Self-check: all time references must come from the run parameters.
Backfilling too broadly: unnecessary cost and risk. Self-check: list the exact partitions you must rebuild and why.
No validation: errors slip through. Self-check: define row-count deltas, uniqueness checks, and key metrics before you run.
Overloading systems: running everything at once. Self-check: set max concurrency and monitor warehouse load.

Practical projects

Build a partitioned daily ETL for orders and implement a backfill CLI that runs a date range with MERGE writes.
Create a data quality notebook that compares metrics pre/post backfill and prints pass/fail per partition.
Implement a watermark table that stores last processed event_date and supports resumes after failures.

Learning path

Start: Partitioning and idempotent writes
Then: Orchestrator parameters and concurrency control
Next: Data quality checks and audit logging
Finally: Dependency-aware downstream refreshes

Next steps

Complete the exercises above.
Take the Quick Test below to check your understanding.
Apply the patterns to one of your production DAGs in a safe environment.

Mini challenge

Your aggregate table undercounted revenue for the last 7 days due to a join bug. Show the minimum set of partitions and downstream tables you will rebuild, in what order, and how you will validate.

Hint

Fix code, pause schedules, backfill source partitions for the 7 days, then rebuild only dependent aggregate partitions.
Validate with revenue totals per day and key uniqueness checks before resuming schedules.

Menu

Backfills And Reruns

Table of Contents