Who this is for
Analytics Engineers and BI developers who maintain incremental data models and need confidence when backfilling historical data without breaking downstream dashboards.
Why this matters
- Most production models are incremental to keep jobs fast and affordable.
- Backfills are needed when logic changes, late-arriving data appears, or you migrate historical tables.
- Badly tested backfills can double-count metrics, create gaps, or overload the warehouse.
Real tasks you will face
- Introduce a new business rule and backfill the last 12 months safely.
- Fix a bug and reprocess a late-arriving window without duplicating rows.
- Switch an incremental key or watermark and prove idempotence.
Concept explained simply
Incremental models process only a slice of data each run (for example, rows with load_time from the last day). Backfills re-run a larger historical window to correct or recompute data. Testing ensures the slice logic and backfill plan produce the same correct table you would get from a full rebuild—without excess cost or risk.
Mental model
Imagine your target table as a ledger. Each run updates only the latest pages. A backfill lets you revisit older pages. Your tests should prove three things: you add only what is missing, you update only what changed, and you can repeat the operation without creating duplicates.
Key principles for testing incremental models & backfills
- Idempotence: multiple runs with the same inputs should produce the same final table.
- Unique keys: assert a stable primary key that stays unique post-merge.
- Watermarks: test your filter boundary (>= vs >) to avoid gaps or overlaps.
- Late-arriving data: incorporate a small rolling backfill window (e.g., last 7 days).
- Partition awareness: validate counts and invariants per partition/date.
- Shadow runs: write to a temporary table and diff before swapping.
Worked examples
Example 1: Append-only incremental with rolling window
Scenario: You load events incrementally using event_timestamp >= max(event_timestamp) from target, plus a 2-day cushion for late data.
-- Safety check: duplicates by primary key after incremental insert
SELECT event_id, COUNT(*) AS c
FROM analytics.events
GROUP BY event_id
HAVING COUNT(*) > 1;
-- Compare target vs full recompute (sample window)
WITH full AS (
SELECT * FROM staging.events WHERE event_timestamp >= DATEADD(day, -7, CURRENT_DATE)
), inc AS (
SELECT * FROM analytics.events WHERE event_timestamp >= DATEADD(day, -7, CURRENT_DATE)
)
SELECT 'missing_in_target' AS issue, COUNT(*)
FROM full f
LEFT JOIN inc i ON i.event_id = f.event_id
WHERE i.event_id IS NULL
UNION ALL
SELECT 'extra_in_target', COUNT(*)
FROM inc i
LEFT JOIN full f ON f.event_id = i.event_id
WHERE f.event_id IS NULL;
Expected: both counts zero; duplicates query returns no rows.
Example 2: Upsert (MERGE) incremental model
Scenario: Orders table with unique key order_id. You update quantities and statuses over time.
-- Detect rows that should be updated in target
WITH src AS (
SELECT order_id, qty, status FROM staging.orders WHERE updated_at >= DATEADD(day, -3, CURRENT_DATE)
), tgt AS (
SELECT order_id, qty, status FROM mart.orders WHERE updated_at >= DATEADD(day, -3, CURRENT_DATE)
)
SELECT COUNT(*) AS to_update
FROM src s
JOIN tgt t ON t.order_id = s.order_id
WHERE (s.qty != t.qty OR s.status != t.status);
-- Idempotence probe: run MERGE into a shadow table, MERGE again, then diff
SELECT COUNT(*) AS diffs
FROM mart.orders_shadow a
FULL OUTER JOIN mart.orders_shadow b
ON a.order_id = b.order_id
WHERE (a.qty != b.qty OR a.status != b.status) OR (a.order_id IS NULL OR b.order_id IS NULL);
Expected: to_update reflects real change volume; diffs = 0 after second run.
Example 3: Backfill by partition with swap
Scenario: You changed revenue logic and must reprocess the last 90 days partition-by-partition.
-- Build shadow per day
CREATE OR REPLACE TABLE mart.revenue_shadow AS
SELECT *
FROM (
SELECT /* new logic */ * FROM staging.revenue
WHERE ds BETWEEN '2024-10-01' AND '2024-12-30'
);
-- Quality checks on shadow
SELECT ds, COUNT(*) AS rows, COUNT(DISTINCT order_id) AS unique_orders
FROM mart.revenue_shadow
GROUP BY ds;
-- Diff per partition vs current
WITH cur AS (SELECT ds, order_id, amount FROM mart.revenue WHERE ds BETWEEN '2024-10-01' AND '2024-12-30'),
new AS (SELECT ds, order_id, amount FROM mart.revenue_shadow)
SELECT ds,
SUM(CASE WHEN c.order_id IS NULL THEN 1 ELSE 0 END) AS missing,
SUM(CASE WHEN n.order_id IS NULL THEN 1 ELSE 0 END) AS extra,
SUM(CASE WHEN c.amount != n.amount THEN 1 ELSE 0 END) AS changed
FROM new n
FULL OUTER JOIN cur c USING (ds, order_id)
GROUP BY ds;
-- Swap (after validation)
-- TRUNCATE partitions in mart.revenue, then INSERT from mart.revenue_shadow by ds.
Expected: Validate row counts, uniqueness, and acceptable changes before the swap.
How to test backfills safely (step-by-step)
- Scope: Choose partitions/dates to reprocess and define success metrics (no duplicates, row counts within tolerance).
- Shadow build: Write output to a temporary or shadow table.
- Validate: Run uniqueness, not-null, referential integrity, and invariants per partition.
- Diff: Compare shadow vs current for missing, extra, and changed rows.
- Dry run performance: Estimate row volume and warehouse cost; throttle if needed.
- Swap safely: Replace partitions atomically (truncate-then-insert), or rename tables.
- Post-verify: Re-run tests; monitor downstream metrics.
Checklists
Before backfill
- Defined primary key and uniqueness test
- Clear watermark logic and boundary test
- Shadow table path and storage capacity confirmed
- Diff queries prepared (counts, duplicates, invariants)
- Throttling plan (batch size, concurrency) approved
After backfill
- Uniqueness and not-null checks pass
- Partition totals within expected tolerance
- No duplicate primary keys
- Metrics spot-check against trusty reports
- Runbook updated with learnings
Exercises
Complete the two exercises below. The quick test is available to everyone; only logged-in users will have their progress saved.
Exercise 1 — Compute new vs updated rows for an incremental MERGE
You have staging.orders and mart.orders with primary key order_id. Write a single SQL query that returns two numbers: new_rows (in staging but not in mart) and updated_rows (same key but different qty or status) for the last 7 days of staging.updated_at.
Exercise 2 — Idempotence check for a partition backfill
The table mart.events is partitioned by ds and keyed by event_id. After running a backfill for ds >= '2024-01-01', write SQL that proves no duplicates exist and that a second run would not change results (idempotence).
Common mistakes and how to self-check
- Off-by-one watermark: Using > instead of >= (or vice versa). Self-check by diffing a 3–7 day window around the boundary.
- No late-data cushion: Missing recent updates. Add a small rolling window and test for updated rows in that window.
- Missing unique key test: Duplicates creep in. Always run uniqueness checks on the target.
- Backfill too wide: Concurrency or quota failures. Throttle by partitions and monitor row counts.
- Skipping shadow diffs: Swapping without a diff. Always perform a shadow build and compare.
Practical projects
- Convert a daily full-refresh fact table to an incremental MERGE with tests and a 14-day rolling cushion.
- Design a 90-day backfill plan using a shadow table and partition-by-partition swap with diffs and invariants.
- Implement an idempotence test suite that runs after every incremental job.
Mini challenge
In one paragraph, define exactly how you will prove idempotence for your most critical incremental model. Include the key, window, and the two queries you will run to show zero duplicates and zero unexpected diffs.
Learning path
- Data quality foundations: uniqueness, not-null, referential checks.
- Incremental strategies: append-only vs upsert (MERGE).
- Backfill patterns: shadow tables, partition swaps, throttling.
- Idempotence and boundary testing.
- Operationalization: runbooks, monitors, and alerts.
Prerequisites
- Comfortable writing SQL (JOINs, aggregations, windowing helpful).
- Basic understanding of incremental processing and partitions.
- Familiarity with your warehouse MERGE/INSERT semantics.
Next steps
- Finish the exercises and run the quick test below.
- Turn one worked example into a real shadow diff in your environment.
- Add automated uniqueness and boundary checks to your CI pipeline.
Progress & saving
The quick test is available to everyone. If you are logged in, your progress and results will be saved automatically.