Why this matters
Analytics Engineers frequently re-run pipelines: to fix bugs, load late-arriving data, or rebuild models after logic changes. Without idempotent loads, you risk duplicate rows, broken reports, and hard-to-debug side effects. Backfills must be safe, repeatable, and predictable so business metrics remain trustworthy.
- Reprocess historical orders after a pricing bug without double-counting revenue.
- Load late events into a fact table while keeping unique keys intact.
- Migrate logic for dimensions (SCD1/SCD2) across months of data safely.
Progress note: Everyone can take the exercises and quick test; only logged-in users have their progress saved.
Concept explained simply
Idempotent load means running the same load multiple times produces the same final table. No extra duplicates, no drift. A backfill is reprocessing historical data (a day, month, or entire history) to correct or add data.
Mental model
Think of the target table as a ledger with strict rules: each record is identified by a stable key, and every run either updates the existing entry or inserts a new one only when it truly does not exist. Backfills play the role of calm, methodical auditors: they revisit specific periods, recompute truth with the same rules, and swap in clean results atomically.
Key ingredients of idempotency
- Deterministic keys: Business or surrogate keys that uniquely identify a row (e.g., order_id).
- Upsert semantics: Use MERGE/UPSERT instead of blind INSERT.
- Deduplication: Remove source duplicates in staging with window functions.
- Audit columns: loaded_at, updated_at, source_hash for change detection.
- Watermarks: Track processed ranges (e.g., max updated_at) with an overlap buffer.
- Partitioning: Backfill by day/month to constrain blast radius.
Core patterns
- Staging with dedup: Stage source rows and deduplicate using ROW_NUMBER() over a business key and updated_at.
- Idempotent upsert: Use MERGE ON key. WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT.
- SCD handling:
- SCD1 (overwrite): Update fields in place via MERGE.
- SCD2 (history): Close old record (valid_to), insert new record (valid_from), using deterministic natural key plus version logic.
- Partitioned backfills: Process a bounded date range, validate, then atomically swap into production tables.
- Shadow tables: Write backfill results to a shadow table, validate, then swap/rename to minimize downtime and risk.
- Watermarks with overlap: Store last_processed_at but always reprocess a small overlap window (e.g., last 2 days) to capture late updates. This remains idempotent because you MERGE.
- Idempotent deletes: For hard deletes, prefer MERGE with conditional delete or rebuild partition + swap, not unbounded DELETE without guardrails.
Worked examples
Example 1: Idempotent incremental load (orders)
Goal: Upsert orders from a source where rows can change later (status or amount updated).
-- 1) Stage and deduplicate by order_id, choose the latest updated_at
CREATE OR REPLACE TEMP TABLE stg_orders AS
SELECT * EXCEPT(rn)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM src_orders
WHERE updated_at >= DATEADD(day, -2, (SELECT COALESCE(MAX(last_processed_at), '1900-01-01') FROM etl_watermarks WHERE job='orders'))
) t
WHERE rn = 1;
-- 2) Idempotent upsert into target
MERGE INTO dw.orders AS tgt
USING stg_orders AS src
ON tgt.order_id = src.order_id
WHEN MATCHED THEN UPDATE SET
status = src.status,
amount = src.amount,
updated_at = src.updated_at,
loaded_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (
order_id, status, amount, updated_at, loaded_at
) VALUES (
src.order_id, src.status, src.amount, src.updated_at, CURRENT_TIMESTAMP
);
-- 3) Advance watermark
MERGE INTO etl_watermarks w
USING (SELECT 'orders' AS job, MAX(updated_at) AS ts FROM stg_orders) s
ON w.job = s.job
WHEN MATCHED THEN UPDATE SET last_processed_at = GREATEST(w.last_processed_at, s.ts)
WHEN NOT MATCHED THEN INSERT (job, last_processed_at) VALUES (s.job, s.ts);
Example 2: Event fact with unique event_id and dedup
Goal: Ensure each event_id appears once even if the source sends duplicates.
-- Stage: keep only the first occurrence by event_id, prefer the earliest event_timestamp
CREATE OR REPLACE TEMP TABLE stg_events AS
SELECT * EXCEPT(rn)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp ASC) AS rn
FROM src_events
WHERE event_timestamp BETWEEN @start_ts AND @end_ts
) t
WHERE rn = 1;
-- Upsert by event_id
MERGE INTO dw.fact_events tgt
USING stg_events src
ON tgt.event_id = src.event_id
WHEN MATCHED THEN UPDATE SET
event_timestamp = src.event_timestamp,
user_id = src.user_id,
event_type = src.event_type,
loaded_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (event_id, event_timestamp, user_id, event_type, loaded_at)
VALUES (src.event_id, src.event_timestamp, src.user_id, src.event_type, CURRENT_TIMESTAMP);
Example 3: Partitioned backfill with shadow swap
Goal: Recompute last 30 days for a dimension, validate, then swap.
-- Rebuild into a shadow table
CREATE OR REPLACE TABLE dw.dim_customer_shadow AS
SELECT *
FROM dim_customer_rebuild_logic
WHERE date_key BETWEEN @start_date AND @end_date;
-- Validate rows and uniqueness
-- (Example checks; adapt to your warehouse)
-- 1) count comparison vs expected
a-- 2) no duplicates on natural key
-- Atomic swap (warehouse-specific syntax may vary)
ALTER TABLE dw.dim_customer RENAME TO dim_customer_old;
ALTER TABLE dw.dim_customer_shadow RENAME TO dim_customer;
DROP TABLE dw.dim_customer_old;
How to design an idempotent pipeline (checklist)
Pre-flight
- Identify a stable key per entity (order_id, event_id, business natural key).
- Add audit columns (loaded_at, updated_at) and consider a source_hash for change detection.
- Decide on SCD1 vs SCD2 behavior and encode rules.
- Define watermark and overlap window (e.g., 48 hours).
- Choose partition strategy for backfills (daily preferred).
During runs
- Stage and deduplicate with window functions before MERGE.
- Use MERGE/UPSERT, never blind INSERT for incrementals.
- Validate uniqueness of target key after load.
- Update watermark only after successful commit.
Backfills
- Backfill in bounded partitions (day/week).
- Use shadow tables and atomic swaps.
- Compare counts and key uniqueness before swap.
- Keep runbooks and be ready to re-run safely.
Exercises
Try these hands-on tasks. They mirror the exercises listed below this lesson:
- Exercise 1: Build an idempotent daily order load using a staging dedup and a MERGE into dw.orders. Include a watermark with overlap.
- Exercise 2: Backfill the last 90 days of events safely using partitioned runs and a shadow table swap. Add validation checks.
Need a nudge?
- Use ROW_NUMBER() over your business key to pick the latest row.
- MERGE by the unique key; UPDATE on match, INSERT on no match.
- Reprocess an overlap window to capture late changes.
- Validate: row counts, uniqueness, and a small sample of business metrics.
Common mistakes and self-check
- Blind INSERTs into incrementals cause duplicates.
- I use MERGE/UPSERT for incrementals.
- No dedup in staging leads to conflicting updates.
- I apply ROW_NUMBER() to pick one winner per key.
- Advancing watermark too early breaks future loads.
- I update the watermark only after successful commit.
- No overlap window misses late-arriving updates.
- I always include a safe overlap (e.g., 1–2 days).
- Monolithic backfills that rewrite entire tables at once.
- I backfill by partition with validation per batch.
Practical projects
- Retail orders pipeline: Build an idempotent incremental load for orders and order_items with a daily backfill job. Acceptance: zero duplicate order_id, stable revenue totals after re-runs.
- Event analytics: Create a fact_events table with unique event_id, overlap reprocessing, and partitioned backfills. Acceptance: unique(event_id) holds, 90-day backfill validated with counts.
- Customer dimension (SCD2): Implement history changes with valid_from/valid_to and a partitioned rebuild. Acceptance: no overlapping validity per customer key, atomic swap on backfill.
Mini challenge
Your incremental job for payments uses updated_at > last_watermark, but late updates are common. Redesign it to be idempotent and resilient.
Show a possible plan
- Stage payments where updated_at >= last_watermark - 2 days.
- Deduplicate by payment_id keeping the latest updated_at.
- MERGE into dw.payments by payment_id (update on match, insert on no match).
- Validate unique(payment_id) and sample totals.
- Advance watermark to max(updated_at) from staged data.
Who this is for
- Analytics Engineers building robust ELT/ETL models.
- Data Engineers maintaining reliable incremental pipelines.
- BI Developers who need backfills that won’t corrupt metrics.
Prerequisites
- Comfort with SQL (JOINs, window functions, MERGE/UPSERT).
- Basic understanding of warehouse partitioning and staging tables.
- Familiarity with SCD1/SCD2 concepts.
Learning path
- Master staging and dedup with window functions.
- Practice MERGE patterns for incrementals (SCD1 and SCD2).
- Implement watermarks with overlap.
- Run a partitioned backfill with shadow swap and validation.
- Automate checks (uniqueness, counts) and re-run safely.
Next steps
- Complete the two exercises below.
- Take the quick test to check your understanding.
- Apply these patterns to a real table in your environment using a small date range.