Testing Incremental Models And Backfills

Learn Testing Incremental Models And Backfills for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Who this is for

Analytics Engineers and BI developers who maintain incremental data models and need confidence when backfilling historical data without breaking downstream dashboards.

Why this matters

Most production models are incremental to keep jobs fast and affordable.
Backfills are needed when logic changes, late-arriving data appears, or you migrate historical tables.
Badly tested backfills can double-count metrics, create gaps, or overload the warehouse.

Real tasks you will face

Introduce a new business rule and backfill the last 12 months safely.
Fix a bug and reprocess a late-arriving window without duplicating rows.
Switch an incremental key or watermark and prove idempotence.

Concept explained simply

Incremental models process only a slice of data each run (for example, rows with load_time from the last day). Backfills re-run a larger historical window to correct or recompute data. Testing ensures the slice logic and backfill plan produce the same correct table you would get from a full rebuild—without excess cost or risk.

Mental model

Imagine your target table as a ledger. Each run updates only the latest pages. A backfill lets you revisit older pages. Your tests should prove three things: you add only what is missing, you update only what changed, and you can repeat the operation without creating duplicates.

Key principles for testing incremental models & backfills

Idempotence: multiple runs with the same inputs should produce the same final table.
Unique keys: assert a stable primary key that stays unique post-merge.
Watermarks: test your filter boundary (>= vs >) to avoid gaps or overlaps.
Late-arriving data: incorporate a small rolling backfill window (e.g., last 7 days).
Partition awareness: validate counts and invariants per partition/date.
Shadow runs: write to a temporary table and diff before swapping.

Worked examples

Example 1: Append-only incremental with rolling window

Scenario: You load events incrementally using event_timestamp >= max(event_timestamp) from target, plus a 2-day cushion for late data.

-- Safety check: duplicates by primary key after incremental insert
SELECT event_id, COUNT(*) AS c
FROM analytics.events
GROUP BY event_id
HAVING COUNT(*) > 1;

-- Compare target vs full recompute (sample window)
WITH full AS (
  SELECT * FROM staging.events WHERE event_timestamp >= DATEADD(day, -7, CURRENT_DATE)
), inc AS (
  SELECT * FROM analytics.events WHERE event_timestamp >= DATEADD(day, -7, CURRENT_DATE)
)
SELECT 'missing_in_target' AS issue, COUNT(*)
FROM full f
LEFT JOIN inc i ON i.event_id = f.event_id
WHERE i.event_id IS NULL
UNION ALL
SELECT 'extra_in_target', COUNT(*)
FROM inc i
LEFT JOIN full f ON f.event_id = i.event_id
WHERE f.event_id IS NULL;

Expected: both counts zero; duplicates query returns no rows.

Example 2: Upsert (MERGE) incremental model

Scenario: Orders table with unique key order_id. You update quantities and statuses over time.

-- Detect rows that should be updated in target
WITH src AS (
  SELECT order_id, qty, status FROM staging.orders WHERE updated_at >= DATEADD(day, -3, CURRENT_DATE)
), tgt AS (
  SELECT order_id, qty, status FROM mart.orders WHERE updated_at >= DATEADD(day, -3, CURRENT_DATE)
)
SELECT COUNT(*) AS to_update
FROM src s
JOIN tgt t ON t.order_id = s.order_id
WHERE (s.qty != t.qty OR s.status != t.status);

-- Idempotence probe: run MERGE into a shadow table, MERGE again, then diff
SELECT COUNT(*) AS diffs
FROM mart.orders_shadow a
FULL OUTER JOIN mart.orders_shadow b
  ON a.order_id = b.order_id
WHERE (a.qty != b.qty OR a.status != b.status) OR (a.order_id IS NULL OR b.order_id IS NULL);

Expected: to_update reflects real change volume; diffs = 0 after second run.

Example 3: Backfill by partition with swap

Scenario: You changed revenue logic and must reprocess the last 90 days partition-by-partition.

-- Build shadow per day
CREATE OR REPLACE TABLE mart.revenue_shadow AS
SELECT *
FROM (
  SELECT /* new logic */ * FROM staging.revenue
  WHERE ds BETWEEN '2024-10-01' AND '2024-12-30'
);

-- Quality checks on shadow
SELECT ds, COUNT(*) AS rows, COUNT(DISTINCT order_id) AS unique_orders
FROM mart.revenue_shadow
GROUP BY ds;

-- Diff per partition vs current
WITH cur AS (SELECT ds, order_id, amount FROM mart.revenue WHERE ds BETWEEN '2024-10-01' AND '2024-12-30'),
     new AS (SELECT ds, order_id, amount FROM mart.revenue_shadow)
SELECT ds,
  SUM(CASE WHEN c.order_id IS NULL THEN 1 ELSE 0 END) AS missing,
  SUM(CASE WHEN n.order_id IS NULL THEN 1 ELSE 0 END) AS extra,
  SUM(CASE WHEN c.amount != n.amount THEN 1 ELSE 0 END) AS changed
FROM new n
FULL OUTER JOIN cur c USING (ds, order_id)
GROUP BY ds;

-- Swap (after validation)
-- TRUNCATE partitions in mart.revenue, then INSERT from mart.revenue_shadow by ds.

Expected: Validate row counts, uniqueness, and acceptable changes before the swap.

How to test backfills safely (step-by-step)

Scope: Choose partitions/dates to reprocess and define success metrics (no duplicates, row counts within tolerance).
Shadow build: Write output to a temporary or shadow table.
Validate: Run uniqueness, not-null, referential integrity, and invariants per partition.
Diff: Compare shadow vs current for missing, extra, and changed rows.
Dry run performance: Estimate row volume and warehouse cost; throttle if needed.
Swap safely: Replace partitions atomically (truncate-then-insert), or rename tables.
Post-verify: Re-run tests; monitor downstream metrics.

Checklists

Before backfill

Defined primary key and uniqueness test
Clear watermark logic and boundary test
Shadow table path and storage capacity confirmed
Diff queries prepared (counts, duplicates, invariants)
Throttling plan (batch size, concurrency) approved

After backfill

Uniqueness and not-null checks pass
Partition totals within expected tolerance
No duplicate primary keys
Metrics spot-check against trusty reports
Runbook updated with learnings

Exercises

Complete the two exercises below. The quick test is available to everyone; only logged-in users will have their progress saved.

Exercise 1 — Compute new vs updated rows for an incremental MERGE

You have staging.orders and mart.orders with primary key order_id. Write a single SQL query that returns two numbers: new_rows (in staging but not in mart) and updated_rows (same key but different qty or status) for the last 7 days of staging.updated_at.

Exercise 2 — Idempotence check for a partition backfill

The table mart.events is partitioned by ds and keyed by event_id. After running a backfill for ds >= '2024-01-01', write SQL that proves no duplicates exist and that a second run would not change results (idempotence).

Common mistakes and how to self-check

Off-by-one watermark: Using > instead of >= (or vice versa). Self-check by diffing a 3–7 day window around the boundary.
No late-data cushion: Missing recent updates. Add a small rolling window and test for updated rows in that window.
Missing unique key test: Duplicates creep in. Always run uniqueness checks on the target.
Backfill too wide: Concurrency or quota failures. Throttle by partitions and monitor row counts.
Skipping shadow diffs: Swapping without a diff. Always perform a shadow build and compare.

Practical projects

Convert a daily full-refresh fact table to an incremental MERGE with tests and a 14-day rolling cushion.
Design a 90-day backfill plan using a shadow table and partition-by-partition swap with diffs and invariants.
Implement an idempotence test suite that runs after every incremental job.

Mini challenge

In one paragraph, define exactly how you will prove idempotence for your most critical incremental model. Include the key, window, and the two queries you will run to show zero duplicates and zero unexpected diffs.

Learning path

Data quality foundations: uniqueness, not-null, referential checks.
Incremental strategies: append-only vs upsert (MERGE).
Backfill patterns: shadow tables, partition swaps, throttling.
Idempotence and boundary testing.
Operationalization: runbooks, monitors, and alerts.

Prerequisites

Comfortable writing SQL (JOINs, aggregations, windowing helpful).
Basic understanding of incremental processing and partitions.
Familiarity with your warehouse MERGE/INSERT semantics.

Next steps

Finish the exercises and run the quick test below.
Turn one worked example into a real shadow diff in your environment.
Add automated uniqueness and boundary checks to your CI pipeline.

Progress & saving

The quick test is available to everyone. If you are logged in, your progress and results will be saved automatically.

Practice Exercises

2 exercises to complete

Instructions

You have staging.orders (order_id, qty, status, updated_at) and mart.orders with the same columns and primary key order_id. In one SQL query, return two columns: new_rows and updated_rows, considering only staging rows where updated_at >= DATEADD(day, -7, CURRENT_DATE). Treat a row as updated if the key exists in mart but qty or status differs.

Expected Output

A single-row result with columns new_rows and updated_rows (both integers).

CI Automation For Tests

Have questions about Testing Incremental Models And Backfills?

AI Assistant

Ask questions about this tool

Menu