Idempotent Loads And Backfills

Learn Idempotent Loads And Backfills for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

Analytics Engineers frequently re-run pipelines: to fix bugs, load late-arriving data, or rebuild models after logic changes. Without idempotent loads, you risk duplicate rows, broken reports, and hard-to-debug side effects. Backfills must be safe, repeatable, and predictable so business metrics remain trustworthy.

Reprocess historical orders after a pricing bug without double-counting revenue.
Load late events into a fact table while keeping unique keys intact.
Migrate logic for dimensions (SCD1/SCD2) across months of data safely.

Progress note: Everyone can take the exercises and quick test; only logged-in users have their progress saved.

Concept explained simply

Idempotent load means running the same load multiple times produces the same final table. No extra duplicates, no drift. A backfill is reprocessing historical data (a day, month, or entire history) to correct or add data.

Mental model

Think of the target table as a ledger with strict rules: each record is identified by a stable key, and every run either updates the existing entry or inserts a new one only when it truly does not exist. Backfills play the role of calm, methodical auditors: they revisit specific periods, recompute truth with the same rules, and swap in clean results atomically.

Key ingredients of idempotency

Deterministic keys: Business or surrogate keys that uniquely identify a row (e.g., order_id).
Upsert semantics: Use MERGE/UPSERT instead of blind INSERT.
Deduplication: Remove source duplicates in staging with window functions.
Audit columns: loaded_at, updated_at, source_hash for change detection.
Watermarks: Track processed ranges (e.g., max updated_at) with an overlap buffer.
Partitioning: Backfill by day/month to constrain blast radius.

Core patterns

Staging with dedup: Stage source rows and deduplicate using ROW_NUMBER() over a business key and updated_at.
Idempotent upsert: Use MERGE ON key. WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT.
SCD handling:
- SCD1 (overwrite): Update fields in place via MERGE.
- SCD2 (history): Close old record (valid_to), insert new record (valid_from), using deterministic natural key plus version logic.
Partitioned backfills: Process a bounded date range, validate, then atomically swap into production tables.
Shadow tables: Write backfill results to a shadow table, validate, then swap/rename to minimize downtime and risk.
Watermarks with overlap: Store last_processed_at but always reprocess a small overlap window (e.g., last 2 days) to capture late updates. This remains idempotent because you MERGE.
Idempotent deletes: For hard deletes, prefer MERGE with conditional delete or rebuild partition + swap, not unbounded DELETE without guardrails.

Worked examples

Example 1: Idempotent incremental load (orders)

Goal: Upsert orders from a source where rows can change later (status or amount updated).

-- 1) Stage and deduplicate by order_id, choose the latest updated_at
CREATE OR REPLACE TEMP TABLE stg_orders AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM src_orders
  WHERE updated_at >= DATEADD(day, -2, (SELECT COALESCE(MAX(last_processed_at), '1900-01-01') FROM etl_watermarks WHERE job='orders'))
) t
WHERE rn = 1;

-- 2) Idempotent upsert into target
MERGE INTO dw.orders AS tgt
USING stg_orders AS src
ON tgt.order_id = src.order_id
WHEN MATCHED THEN UPDATE SET
  status       = src.status,
  amount       = src.amount,
  updated_at   = src.updated_at,
  loaded_at    = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (
  order_id, status, amount, updated_at, loaded_at
) VALUES (
  src.order_id, src.status, src.amount, src.updated_at, CURRENT_TIMESTAMP
);

-- 3) Advance watermark
MERGE INTO etl_watermarks w
USING (SELECT 'orders' AS job, MAX(updated_at) AS ts FROM stg_orders) s
ON w.job = s.job
WHEN MATCHED THEN UPDATE SET last_processed_at = GREATEST(w.last_processed_at, s.ts)
WHEN NOT MATCHED THEN INSERT (job, last_processed_at) VALUES (s.job, s.ts);

Example 2: Event fact with unique event_id and dedup

Goal: Ensure each event_id appears once even if the source sends duplicates.

-- Stage: keep only the first occurrence by event_id, prefer the earliest event_timestamp
CREATE OR REPLACE TEMP TABLE stg_events AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp ASC) AS rn
  FROM src_events
  WHERE event_timestamp BETWEEN @start_ts AND @end_ts
) t
WHERE rn = 1;

-- Upsert by event_id
MERGE INTO dw.fact_events tgt
USING stg_events src
ON tgt.event_id = src.event_id
WHEN MATCHED THEN UPDATE SET
  event_timestamp = src.event_timestamp,
  user_id         = src.user_id,
  event_type      = src.event_type,
  loaded_at       = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (event_id, event_timestamp, user_id, event_type, loaded_at)
VALUES (src.event_id, src.event_timestamp, src.user_id, src.event_type, CURRENT_TIMESTAMP);

Example 3: Partitioned backfill with shadow swap

Goal: Recompute last 30 days for a dimension, validate, then swap.

-- Rebuild into a shadow table
CREATE OR REPLACE TABLE dw.dim_customer_shadow AS
SELECT *
FROM dim_customer_rebuild_logic
WHERE date_key BETWEEN @start_date AND @end_date;

-- Validate rows and uniqueness
-- (Example checks; adapt to your warehouse)
-- 1) count comparison vs expected
a-- 2) no duplicates on natural key

-- Atomic swap (warehouse-specific syntax may vary)
ALTER TABLE dw.dim_customer RENAME TO dim_customer_old;
ALTER TABLE dw.dim_customer_shadow RENAME TO dim_customer;
DROP TABLE dw.dim_customer_old;

Exercises

Try these hands-on tasks. They mirror the exercises listed below this lesson:

Exercise 1: Build an idempotent daily order load using a staging dedup and a MERGE into dw.orders. Include a watermark with overlap.
Exercise 2: Backfill the last 90 days of events safely using partitioned runs and a shadow table swap. Add validation checks.

Need a nudge?

Use ROW_NUMBER() over your business key to pick the latest row.
MERGE by the unique key; UPDATE on match, INSERT on no match.
Reprocess an overlap window to capture late changes.
Validate: row counts, uniqueness, and a small sample of business metrics.

Common mistakes and self-check

Blind INSERTs into incrementals cause duplicates.
- I use MERGE/UPSERT for incrementals.
No dedup in staging leads to conflicting updates.
- I apply ROW_NUMBER() to pick one winner per key.
Advancing watermark too early breaks future loads.
- I update the watermark only after successful commit.
No overlap window misses late-arriving updates.
- I always include a safe overlap (e.g., 1–2 days).
Monolithic backfills that rewrite entire tables at once.
- I backfill by partition with validation per batch.

Practical projects

Retail orders pipeline: Build an idempotent incremental load for orders and order_items with a daily backfill job. Acceptance: zero duplicate order_id, stable revenue totals after re-runs.
Event analytics: Create a fact_events table with unique event_id, overlap reprocessing, and partitioned backfills. Acceptance: unique(event_id) holds, 90-day backfill validated with counts.
Customer dimension (SCD2): Implement history changes with valid_from/valid_to and a partitioned rebuild. Acceptance: no overlapping validity per customer key, atomic swap on backfill.

Mini challenge

Your incremental job for payments uses updated_at > last_watermark, but late updates are common. Redesign it to be idempotent and resilient.

Show a possible plan

Stage payments where updated_at >= last_watermark - 2 days.
Deduplicate by payment_id keeping the latest updated_at.
MERGE into dw.payments by payment_id (update on match, insert on no match).
Validate unique(payment_id) and sample totals.
Advance watermark to max(updated_at) from staged data.

Who this is for

Analytics Engineers building robust ELT/ETL models.
Data Engineers maintaining reliable incremental pipelines.
BI Developers who need backfills that won’t corrupt metrics.

Prerequisites

Comfort with SQL (JOINs, window functions, MERGE/UPSERT).
Basic understanding of warehouse partitioning and staging tables.
Familiarity with SCD1/SCD2 concepts.

Learning path

Master staging and dedup with window functions.
Practice MERGE patterns for incrementals (SCD1 and SCD2).
Implement watermarks with overlap.
Run a partitioned backfill with shadow swap and validation.
Automate checks (uniqueness, counts) and re-run safely.

Next steps

Complete the two exercises below.
Take the quick test to check your understanding.
Apply these patterns to a real table in your environment using a small date range.

Practice Exercises

2 exercises to complete

Instructions

You have src_orders(order_id, status, amount, updated_at). Target is dw.orders with the same columns plus loaded_at. Create an idempotent daily incremental load that:

Stages and deduplicates by order_id keeping the row with the latest updated_at.
Processes rows where updated_at >= last_watermark - 2 days (overlap).
MERGEs into dw.orders (update on match, insert on no match).
Updates a table etl_watermarks(job, last_processed_at) for job='orders' after success.

Provide SQL or pseudo-SQL that would run safely multiple times without creating duplicates.

Expected Output

Running the job repeatedly yields the same dw.orders contents for the processed period; unique(order_id) holds; last_processed_at advances to the max updated_at from staged data.