Why this matters
Backfills are how you fix historical data, migrate models, or recompute metrics after a bug. Done poorly, they can overload warehouses, duplicate rows, break dashboards, or corrupt downstream models. Done well, they are boring and safe.
- Real task: Recompute 90 days of revenue after a pricing logic change.
- Real task: Fill missing event days due to an ingestion outage.
- Real task: Backfill a new dimension while keeping dashboards online.
Who this is for
- Analytics Engineers and BI Developers who run or schedule data pipelines.
- Data Engineers adding safe reruns and historical loads to orchestration.
- Team leads who need a repeatable backfill playbook.
Prerequisites
- Comfort with SQL (SELECT, JOIN, aggregation, window functions).
- Basic orchestration knowledge (DAGs, tasks, dependencies, retries).
- Understanding of incremental modeling, partitions, and idempotent writes.
Concept explained simply
Backfill = reprocessing historical data for a defined time/window to repair or update results.
Safe backfill = scoped, idempotent, controlled, observable, and reversible.
Mental model
Imagine a warehouse corridor with many rooms (partitions by day). A safe backfill is a cart rolling room-by-room:
- It opens one or a few rooms at a time (partitioned, rate-limited).
- It carries a checklist to verify each room (data tests & row counts).
- It leaves a marker in each room (audit columns, watermark) so rerunning is safe (idempotent upserts).
- It works in a mock corridor first (shadow tables), then swaps when satisfied (WAP: write-audit-publish).
Safety checklist (use before any backfill)
- Define scope: exact time window and affected tables/models.
- Partitioned runs only; set max concurrency (e.g., 1–3 partitions at a time).
- Idempotent writes: use upserts/merges keyed by a stable primary key.
- Dedupe strategy: unique keys, checksums, or windowed ROW_NUMBER filtering.
- Shadow outputs (WAP): write to new tables/schemas, validate, then swap atomically.
- Data quality tests: null/unique checks, referential integrity, rowcount deltas.
- Resource controls: task-level timeouts, retries with backoff, warehouse slot/cluster sizing.
- Isolation: freeze DAG version; disable or pin downstream consumers during backfill.
- Resume safely: per-partition checkpoints; reruns should not duplicate.
- Rollback plan: revert swap or restore from snapshot if metrics drift.
Worked examples
Example 1 — Backfill last 7 days of events to add a derived column
- Scope: events 7 days, derived column session_length_sec.
- Create shadow table events_v2_shadow with new column and same keys (event_id, event_date).
- Run per-day tasks (concurrency=2). For each day: MERGE from staging into events_v2_shadow where event_date = :day.
- Tests: unique(event_id), not null(event_time), compare rowcount ±2% vs current.
- Compare aggregates between current and shadow (e.g., median session length).
- Atomic swap: rename events -> events_v1_archive, events_v2_shadow -> events.
- Unpause downstream tasks; monitor dashboards for 24 hours.
-- Idempotent daily merge (pseudo-SQL)
MERGE INTO analytics.events_v2_shadow t
USING staging.events s
ON t.event_id = s.event_id
WHEN MATCHED AND (t.session_length_sec IS DISTINCT FROM s.session_length_sec) THEN
UPDATE SET session_length_sec = s.session_length_sec, updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED AND s.event_date = :day THEN
INSERT (...columns...) VALUES (...)
Example 2 — Recompute gross margin for 3 months with SCD costs
- Scope: 3 months of fact_orders_daily, needs SCD cost snapshot at order_date.
- Backfill staging: join orders to costs valid at order_date using effective_from/to window.
- Write to fact_orders_v2_shadow by partition (order_date), MERGE on order_id.
- Tests: unique(order_id), revenue >= cost >= 0, margin% in [0, 1.5] (tolerant).
- Performance: limit to 1 partition at a time; increase warehouse only during off-peak.
- Publish: swap shadow to production; archive old table.
-- Join with SCD window
SELECT o.order_id, o.order_date,
c.unit_cost
FROM staging.orders o
JOIN dim_costs_scd c
ON o.sku = c.sku
AND o.order_date >= c.effective_from
AND (c.effective_to IS NULL OR o.order_date < c.effective_to)
Example 3 — Fix a metric bug without breaking dashboards
- Pin dashboards to current tables (disable auto-refresh of new versions).
- Clone prod dataset to sandbox; run full backfill there to validate logic and performance.
- In production, run WAP backfill into v3_shadow per partition, with tight tests.
- Compare aggregates: totals, distinct entities, and key percentiles day-by-day.
- During a quiet period, swap v3_shadow to v3; update dashboard sources to v3.
- Keep v2 archived for 14–30 days for rollback.
Exercises
Try these hands-on tasks. Then check your answers in the solution panels. Tip: write your plan before touching data.
Exercise 1 — Plan a safe 30-day backfill (orders table)
See the Exercises section below for full instructions and solutions.
Exercise 2 — Idempotent SQL for a partitioned backfill
Write a MERGE-based upsert that only updates changed rows for one date partition.
Pre-run checklist
- Scope is a list of specific dates/partitions.
- Concurrency limited; warehouse capacity set.
- Writes are MERGE/UPSERT, not INSERT-only.
- Shadow tables in place; downstream paused/pinned.
- Tests defined with pass/fail thresholds.
- Rollback plan written and accessible.
Common mistakes and self-checks
Mistake: INSERT-only backfill creates duplicates
Fix: Use MERGE/UPSERT keyed by a stable primary key. Add unique constraints or post-merge dedup logic.
Mistake: Overloading the warehouse
Fix: Limit concurrency, run off-peak, and batch by small partitions (day/hour). Use task-level timeouts and retries with backoff.
Mistake: Publishing before validation
Fix: Always WAP: write to shadow, audit with tests, compare aggregates, then publish via atomic swap.
Mistake: No rollback
Fix: Keep prior version for 14–30 days; predefine the revert command (swap back or restore snapshot).
Mistake: Unclear ownership and comms
Fix: Assign a backfill owner, post a maintenance notice, and define success criteria before starting.
Practical projects
- Build a backfill orchestrator DAG that runs daily partitions for a month with concurrency=2 and a resume capability.
- Implement WAP: write to shadow tables, run tests, and perform a metadata-only swap when tests pass.
- Create a data quality suite for a fact table (unique keys, null checks, rowcount deltas, referential checks) and wire it into the backfill flow.
Learning path
- Before: Incremental modeling, partitioning, and data quality testing basics.
- Now: Handling Backfills Safely (this lesson).
- Next: Versioned deployments, blue/green publishing, and change data capture (CDC) integration.
Next steps
- Draft your team’s standard backfill runbook using the Safety checklist.
- Dry-run on a sandbox for 2–3 partitions; measure performance and correctness.
- Take the Quick Test to confirm understanding. Note: anyone can take it; sign in to save your progress.
Mini challenge
Scenario: A bug doubled quantities for 4 days in fact_order_items. Design a safe backfill to fix those days without breaking dashboards.
Show sample approach
- Scope: the 4 affected dates; freeze DAG version.
- Create fact_order_items_v2_shadow with same schema and unique key (order_id, item_id).
- For each date, MERGE corrected rows from staging where date = :d; ensure quantity fix is applied in staging logic.
- Run tests: unique keys, non-negative quantity, rowcount within ±1% vs original, total quantity within expected bounds.
- Swap shadow to production; archive old table; monitor key dashboards.
- Rollback plan: swap back to previous version or restore snapshot.
Quick Test
Available to everyone. Sign in to save your progress and see it in your learning timeline.