luvv to helpDiscover the Best Free Online Tools
Topic 3 of 10

Idempotent Loads And Backfills

Learn Idempotent Loads And Backfills for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

Analytics Engineers frequently re-run pipelines: to fix bugs, load late-arriving data, or rebuild models after logic changes. Without idempotent loads, you risk duplicate rows, broken reports, and hard-to-debug side effects. Backfills must be safe, repeatable, and predictable so business metrics remain trustworthy.

  • Reprocess historical orders after a pricing bug without double-counting revenue.
  • Load late events into a fact table while keeping unique keys intact.
  • Migrate logic for dimensions (SCD1/SCD2) across months of data safely.

Progress note: Everyone can take the exercises and quick test; only logged-in users have their progress saved.

Concept explained simply

Idempotent load means running the same load multiple times produces the same final table. No extra duplicates, no drift. A backfill is reprocessing historical data (a day, month, or entire history) to correct or add data.

Mental model

Think of the target table as a ledger with strict rules: each record is identified by a stable key, and every run either updates the existing entry or inserts a new one only when it truly does not exist. Backfills play the role of calm, methodical auditors: they revisit specific periods, recompute truth with the same rules, and swap in clean results atomically.

Key ingredients of idempotency
  • Deterministic keys: Business or surrogate keys that uniquely identify a row (e.g., order_id).
  • Upsert semantics: Use MERGE/UPSERT instead of blind INSERT.
  • Deduplication: Remove source duplicates in staging with window functions.
  • Audit columns: loaded_at, updated_at, source_hash for change detection.
  • Watermarks: Track processed ranges (e.g., max updated_at) with an overlap buffer.
  • Partitioning: Backfill by day/month to constrain blast radius.

Core patterns

  • Staging with dedup: Stage source rows and deduplicate using ROW_NUMBER() over a business key and updated_at.
  • Idempotent upsert: Use MERGE ON key. WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT.
  • SCD handling:
    • SCD1 (overwrite): Update fields in place via MERGE.
    • SCD2 (history): Close old record (valid_to), insert new record (valid_from), using deterministic natural key plus version logic.
  • Partitioned backfills: Process a bounded date range, validate, then atomically swap into production tables.
  • Shadow tables: Write backfill results to a shadow table, validate, then swap/rename to minimize downtime and risk.
  • Watermarks with overlap: Store last_processed_at but always reprocess a small overlap window (e.g., last 2 days) to capture late updates. This remains idempotent because you MERGE.
  • Idempotent deletes: For hard deletes, prefer MERGE with conditional delete or rebuild partition + swap, not unbounded DELETE without guardrails.

Worked examples

Example 1: Idempotent incremental load (orders)

Goal: Upsert orders from a source where rows can change later (status or amount updated).

-- 1) Stage and deduplicate by order_id, choose the latest updated_at
CREATE OR REPLACE TEMP TABLE stg_orders AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM src_orders
  WHERE updated_at >= DATEADD(day, -2, (SELECT COALESCE(MAX(last_processed_at), '1900-01-01') FROM etl_watermarks WHERE job='orders'))
) t
WHERE rn = 1;

-- 2) Idempotent upsert into target
MERGE INTO dw.orders AS tgt
USING stg_orders AS src
ON tgt.order_id = src.order_id
WHEN MATCHED THEN UPDATE SET
  status       = src.status,
  amount       = src.amount,
  updated_at   = src.updated_at,
  loaded_at    = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (
  order_id, status, amount, updated_at, loaded_at
) VALUES (
  src.order_id, src.status, src.amount, src.updated_at, CURRENT_TIMESTAMP
);

-- 3) Advance watermark
MERGE INTO etl_watermarks w
USING (SELECT 'orders' AS job, MAX(updated_at) AS ts FROM stg_orders) s
ON w.job = s.job
WHEN MATCHED THEN UPDATE SET last_processed_at = GREATEST(w.last_processed_at, s.ts)
WHEN NOT MATCHED THEN INSERT (job, last_processed_at) VALUES (s.job, s.ts);

Example 2: Event fact with unique event_id and dedup

Goal: Ensure each event_id appears once even if the source sends duplicates.

-- Stage: keep only the first occurrence by event_id, prefer the earliest event_timestamp
CREATE OR REPLACE TEMP TABLE stg_events AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp ASC) AS rn
  FROM src_events
  WHERE event_timestamp BETWEEN @start_ts AND @end_ts
) t
WHERE rn = 1;

-- Upsert by event_id
MERGE INTO dw.fact_events tgt
USING stg_events src
ON tgt.event_id = src.event_id
WHEN MATCHED THEN UPDATE SET
  event_timestamp = src.event_timestamp,
  user_id         = src.user_id,
  event_type      = src.event_type,
  loaded_at       = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (event_id, event_timestamp, user_id, event_type, loaded_at)
VALUES (src.event_id, src.event_timestamp, src.user_id, src.event_type, CURRENT_TIMESTAMP);

Example 3: Partitioned backfill with shadow swap

Goal: Recompute last 30 days for a dimension, validate, then swap.

-- Rebuild into a shadow table
CREATE OR REPLACE TABLE dw.dim_customer_shadow AS
SELECT *
FROM dim_customer_rebuild_logic
WHERE date_key BETWEEN @start_date AND @end_date;

-- Validate rows and uniqueness
-- (Example checks; adapt to your warehouse)
-- 1) count comparison vs expected
a-- 2) no duplicates on natural key

-- Atomic swap (warehouse-specific syntax may vary)
ALTER TABLE dw.dim_customer RENAME TO dim_customer_old;
ALTER TABLE dw.dim_customer_shadow RENAME TO dim_customer;
DROP TABLE dw.dim_customer_old;

How to design an idempotent pipeline (checklist)

Pre-flight

  • Identify a stable key per entity (order_id, event_id, business natural key).
  • Add audit columns (loaded_at, updated_at) and consider a source_hash for change detection.
  • Decide on SCD1 vs SCD2 behavior and encode rules.
  • Define watermark and overlap window (e.g., 48 hours).
  • Choose partition strategy for backfills (daily preferred).

During runs

  • Stage and deduplicate with window functions before MERGE.
  • Use MERGE/UPSERT, never blind INSERT for incrementals.
  • Validate uniqueness of target key after load.
  • Update watermark only after successful commit.

Backfills

  • Backfill in bounded partitions (day/week).
  • Use shadow tables and atomic swaps.
  • Compare counts and key uniqueness before swap.
  • Keep runbooks and be ready to re-run safely.

Exercises

Try these hands-on tasks. They mirror the exercises listed below this lesson:

  1. Exercise 1: Build an idempotent daily order load using a staging dedup and a MERGE into dw.orders. Include a watermark with overlap.
  2. Exercise 2: Backfill the last 90 days of events safely using partitioned runs and a shadow table swap. Add validation checks.
Need a nudge?
  • Use ROW_NUMBER() over your business key to pick the latest row.
  • MERGE by the unique key; UPDATE on match, INSERT on no match.
  • Reprocess an overlap window to capture late changes.
  • Validate: row counts, uniqueness, and a small sample of business metrics.

Common mistakes and self-check

  • Blind INSERTs into incrementals cause duplicates.
    • I use MERGE/UPSERT for incrementals.
  • No dedup in staging leads to conflicting updates.
    • I apply ROW_NUMBER() to pick one winner per key.
  • Advancing watermark too early breaks future loads.
    • I update the watermark only after successful commit.
  • No overlap window misses late-arriving updates.
    • I always include a safe overlap (e.g., 1–2 days).
  • Monolithic backfills that rewrite entire tables at once.
    • I backfill by partition with validation per batch.

Practical projects

  • Retail orders pipeline: Build an idempotent incremental load for orders and order_items with a daily backfill job. Acceptance: zero duplicate order_id, stable revenue totals after re-runs.
  • Event analytics: Create a fact_events table with unique event_id, overlap reprocessing, and partitioned backfills. Acceptance: unique(event_id) holds, 90-day backfill validated with counts.
  • Customer dimension (SCD2): Implement history changes with valid_from/valid_to and a partitioned rebuild. Acceptance: no overlapping validity per customer key, atomic swap on backfill.

Mini challenge

Your incremental job for payments uses updated_at > last_watermark, but late updates are common. Redesign it to be idempotent and resilient.

Show a possible plan
  1. Stage payments where updated_at >= last_watermark - 2 days.
  2. Deduplicate by payment_id keeping the latest updated_at.
  3. MERGE into dw.payments by payment_id (update on match, insert on no match).
  4. Validate unique(payment_id) and sample totals.
  5. Advance watermark to max(updated_at) from staged data.

Who this is for

  • Analytics Engineers building robust ELT/ETL models.
  • Data Engineers maintaining reliable incremental pipelines.
  • BI Developers who need backfills that won’t corrupt metrics.

Prerequisites

  • Comfort with SQL (JOINs, window functions, MERGE/UPSERT).
  • Basic understanding of warehouse partitioning and staging tables.
  • Familiarity with SCD1/SCD2 concepts.

Learning path

  1. Master staging and dedup with window functions.
  2. Practice MERGE patterns for incrementals (SCD1 and SCD2).
  3. Implement watermarks with overlap.
  4. Run a partitioned backfill with shadow swap and validation.
  5. Automate checks (uniqueness, counts) and re-run safely.

Next steps

  • Complete the two exercises below.
  • Take the quick test to check your understanding.
  • Apply these patterns to a real table in your environment using a small date range.

Practice Exercises

2 exercises to complete

Instructions

You have src_orders(order_id, status, amount, updated_at). Target is dw.orders with the same columns plus loaded_at. Create an idempotent daily incremental load that:

  • Stages and deduplicates by order_id keeping the row with the latest updated_at.
  • Processes rows where updated_at >= last_watermark - 2 days (overlap).
  • MERGEs into dw.orders (update on match, insert on no match).
  • Updates a table etl_watermarks(job, last_processed_at) for job='orders' after success.

Provide SQL or pseudo-SQL that would run safely multiple times without creating duplicates.

Expected Output
Running the job repeatedly yields the same dw.orders contents for the processed period; unique(order_id) holds; last_processed_at advances to the max updated_at from staged data.

Idempotent Loads And Backfills — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Idempotent Loads And Backfills?

AI Assistant

Ask questions about this tool