Who this is for
- Data Engineers building batch or streaming pipelines.
- Analytics Engineers managing incremental models.
- Platform Engineers maintaining ingestion frameworks.
Prerequisites
- Comfort with SQL (INSERT, MERGE/UPSERT, DELETE).
- Basic understanding of batch files, partitions, and/or CDC streams.
- Familiarity with your warehouse or lake write semantics (atomic rename, transactions).
Learning path
- Grasp idempotency: same input, same end state.
- Learn core patterns: staging+MERGE, watermarks, dedup, atomic file moves.
- Apply to batch and streaming examples.
- Design a safe backfill plan with checkpoints and validation.
- Practice with exercises; take the quick test.
Why this matters
Real pipelines fail and retry. Without idempotency, retries create duplicates, gaps, or corrupt metrics. Backfills are routine: schema bugs, logic fixes, or new dimensions require rebuilding history without breaking dashboards. Employers expect you to run pipelines repeatedly with the same result, and to restore historical truth safely and predictably.
- Daily production task: re-run a failed job without duplicating yesterday’s orders.
- Incident response: reload a month after fixing a bug in mapping logic.
- Migration: move from one ingestion tool to another, validating parity.
Concept explained simply
Idempotency means: applying the same load multiple times results in the same final dataset. Think of it as a safe light switch: flipping it twice leaves the light on once.
Mental model
- Desired end state, not number of writes: The system should converge to the correct dataset regardless of retries.
- Write intent, then reconcile: Land data safely, then reconcile with a deterministic rule (e.g., primary key+timestamp).
- Checkpoints as promises: Watermarks, run_ids, and version columns express what has been processed and how conflicts are resolved.
Core patterns
1) Stage + MERGE (UPSERT)
Land inputs in a staging area (write-once). Reconcile into the target using keys.
MERGE INTO target t
USING staged s
ON t.id = s.id
WHEN MATCHED AND s.updated_at >= t.updated_at THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)- Idempotent because the same staged rows MERGE to the same end state.
- Prefer last-write-wins using an event time or version.
2) Atomic file moves + deterministic names
- Write temp files, then atomically rename to final path (prevents partial reads).
- Use unique, deterministic file names per source batch (e.g., partition_date=2025-05-01/run_id=20250501_01).
- Readers ignore temp files; retries overwrite the same path or are skipped.
3) Watermarks and checkpoints
- High-watermark: last processed offset/timestamp.
- Low-watermark: reprocess window to catch late events (e.g., re-read last 2 days).
- Store in durable state (table, log, or metadata file).
4) Deduplication
- Use primary key + event_time or a unique event_id.
- In batch: SELECT ... QUALIFY ROW_NUMBER() OVER (PARTITION BY key ORDER BY event_time DESC)=1.
- In streaming: stateful dedup within a TTL window or via target MERGE.
5) Snapshot vs incremental
- Snapshot (full replace): INSERT OVERWRITE partition or CREATE OR REPLACE table. Simple, costly, but idempotent.
- Incremental: only new/changed records; requires keys, dedup, and watermarks.
6) CDC and exactly-once
- Exactly-once delivery is rare; prefer at-least-once + idempotent apply.
- CDC MERGE by primary key + commit version to prevent double apply.
Worked examples
Example 1: Daily batch to warehouse with retries
- Write raw file to temp: tmp/orders/2025-05-01/_run_1.snappy
- Atomic rename: raw/orders/partition_date=2025-05-01/run_id=20250501_01.snappy
- Load staged table from that file (append-only).
- MERGE staged into dim_orders by order_id with last-write-wins on updated_at.
Retry writes the same file path and the same staged rows; MERGE produces the same target state.
Example 2: Streaming CDC to warehouse
- Source: CDC events with op (I/U/D), commit_version, pk id.
- Apply with MERGE on (id) and choose the highest commit_version per id.
- Delete events: WHEN MATCHED AND s.op='D' THEN DELETE.
Replayed messages are ignored because commit_version is not greater than the applied one.
Example 3: Backfill a new column with minimal downtime
- Create shadow table target_v2 with the new column.
- Backfill historical data into target_v2 by partitions with MERGE (id + updated_at).
- Dual-write new increments to both tables for a window.
- Cutover: swap tables (or insert overwrite alias) atomically.
Idempotent because each partition MERGE can be retried; cutover is a single atomic step.
Example 4: Late-arriving events
- Process window: every run reads data newer than high_watermark - 2 days.
- Dedup by event_id or (key, event_time).
- Advance high_watermark to max seen time; on retry, same rows converge.
Exercises
Do these before the test. Tip: keep outputs concise and deterministic.
Exercise 1 — Design an idempotent daily load
Mirror of the exercise in the panel below.
- Define file naming, staging, and target MERGE keys.
- Explain how a retry yields the same end state.
- Add a simple data quality check you can re-run safely.
Exercise 2 — Plan a safe backfill for a year
Mirror of the exercise in the panel below.
- Pick partitioning strategy and window size.
- Describe validation and cutover steps.
- Explain rollback if a partition fails.
Exercise checklist
- Uses deterministic keys (pk + version/time).
- Includes a write-once staging or atomic file move.
- Explains retry behavior clearly.
- Defines watermarks and validation queries.
- Specifies rollback for backfills.
Common mistakes and self-check
- Mistake: Blind INSERT into target on every retry. Fix: use MERGE or INSERT OVERWRITE by partition.
- Mistake: Random file names per retry. Fix: deterministic paths or run_id, plus atomic rename.
- Mistake: No dedup key. Fix: enforce primary key + event_time or unique event_id.
- Mistake: Advancing watermark without a safety window. Fix: re-read N days to capture late data.
- Mistake: Backfill writes into live table without validation. Fix: use shadow tables or partitions + validation, then swap.
Self-check prompts
- If you re-run the last job 3 times, what exactly changes in the target?
- Which single column(s) decide who wins on conflict?
- Where is your watermark stored, and how do you recompute it?
- How would you delete or correct bad rows idempotently?
Practical projects
- Build a mini pipeline: local files to a SQLite/Postgres warehouse with a staging table and a MERGE/UPSERT. Include a simple retry script.
- Implement a backfill: compute a new metric for 90 days by daily partitions. Validate row counts and aggregates per day; add a cutover script.
- CDC playground: simulate inserts/updates/deletes with commit_version; apply idempotently and prove duplicates do not change results.
Quick Test availability
The quick test is available to everyone. Only logged-in users will have their progress saved.
Mini challenge
Your orders table has duplicates because a job retried mid-run. Without truncating the table, write a single SQL statement (or sequence) that removes duplicates using (order_id, updated_at) with last-write-wins, and make the process re-runnable without side effects.
Hint
Think staging + QUALIFY/ROW_NUMBER or MERGE with a deduped subquery.
Next steps
- Harden your pipelines: add a small safety window to all incremental reads.
- Create a run book for backfills: checklist, validation queries, and rollback steps.
- Add audits: counts per partition and max(updated_at) to detect drift quickly.