Who this is for
ETL Developers, Data Engineers, and Analysts who need to load only changed data reliably from source systems into data warehouses or lakes.
Prerequisites
- Comfort with SQL (JOIN, WHERE, GROUP BY, window functions)
- Basic ETL/ELT concepts (extract, stage, transform, load)
- Familiarity with tables, primary keys, and timestamps
Why this matters
In real teams you will:
- Load millions of rows nightly without overwhelming sources or targets
- Refresh dashboards hourly with only changed records
- Handle late-arriving data and deletes correctly
- Backfill safely after outages without duplicates
Incremental logic turns these into fast, reliable, and cost-effective pipelines.
Concept explained simply
Incremental loading means processing only what changed since the last run. You remember a checkpoint (watermark), fetch new/updated rows after that point, and merge them into the target. If done right, running the job twice produces the same result (idempotency).
Mental model: Bookmarks and pages
Imagine reading a long book every day. You place a bookmark (watermark) on the last page you finished. Next day, you start from the bookmark and continue. If you misplace a page, you re-read the last few pages (overlap window) to be safe. When a page is removed (delete), you mark it missing in your notes.
Key components of incremental logic
- Watermark: A stored value marking the last processed point (e.g., max updated_at, last event_id, or CDC log position).
- Change detection: Using updated_at, version numbers, or Change Data Capture (CDC) logs with operation codes (I/U/D).
- Merge/Upsert: Applying changes via SQL MERGE or equivalent logic to insert, update, or mark deletes.
- Idempotency: Re-running a job results in the same target state. Typically achieved via deterministic merges, deduping, and careful watermark updates.
- Deduplication: Remove duplicates using primary keys + latest timestamp or event order (row_number).
- Late-arriving data: Allow a small reprocess window (e.g., last 1–3 days) or use event-time windows to catch late events.
- Delete handling: From CDC delete events or via anti-join (if the source is a full snapshot) to soft-delete rows.
- Backfills and retries: Support reprocessing specific date ranges and safe retries without double-counting.
- Observability: Record counts read/written, inserts/updates/deletes, and watermark values for traceability.
Worked examples
Example 1: Timestamp watermark (OLTP → DWH)
Source: orders(id, status, amount, updated_at). Target: dwh.orders keyed by id.
-- Choose watermark as source updated_at (indexed, monotonic per row)
WITH inc AS (
SELECT *
FROM src.orders
WHERE updated_at > :last_watermark
AND updated_at <= :current_cutoff
), latest AS (
SELECT * FROM (
SELECT o.*, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
FROM inc o
) x WHERE rn = 1
)
MERGE INTO dwh.orders t
USING latest s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
status = s.status,
amount = s.amount,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (id, status, amount, updated_at)
VALUES (s.id, s.status, s.amount, s.updated_at);
-- Advance watermark after successful commit
-- :last_watermark := (SELECT COALESCE(MAX(updated_at), :last_watermark) FROM latest);
Notes: Dedup with row_number, merge deterministically, and update watermark only after success.
Example 2: CDC with operation codes
Source CDC stream: orders_cdc(id, op, updated_at, ...), where op in ('I','U','D').
WITH seq AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at) AS rn
FROM src.orders_cdc
WHERE updated_at > :last_watermark AND updated_at <= :current_cutoff
), final_events AS (
SELECT * FROM seq WHERE rn = (SELECT MAX(rn) FROM seq s2 WHERE s2.id = seq.id)
)
MERGE INTO dwh.orders t
USING final_events s ON t.id = s.id
WHEN MATCHED AND s.op = 'D' THEN UPDATE SET is_deleted = TRUE, updated_at = s.updated_at
WHEN MATCHED AND s.op IN ('I','U') THEN UPDATE SET col1 = s.col1, col2 = s.col2, is_deleted = FALSE, updated_at = s.updated_at
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT (..., is_deleted, updated_at) VALUES (..., FALSE, s.updated_at);
Notes: Keep the latest event per id within the window; mark deletes without physically removing unless required.
Example 3: File-based loads (daily partitions)
Landing zone: /raw/sales/dt=YYYY-MM-DD/*.parquet. Reprocess recent days to catch late files.
-- For run date R, process range [R - overlap_days, R]
-- Maintain processed_partitions table to track dt processed and row counts
-- Pseudocode:
for dt in dates_between(R - overlap_days, R):
df = read("/raw/sales/dt=" + dt)
df_dedup = df.with_row_number(partitionBy=["order_id"], orderBy=["event_time"].desc).where(rn==1)
merge_into_target(df_dedup, key=["order_id"], event_time_col="event_time")
upsert_processed_partition(dt, rows_written=df_dedup.count())
Notes: Track processed partitions, deduplicate, and keep a small overlap window.
When to choose which method
- Timestamp watermark: Simple tables with reliable updated_at.
- CDC: High-change tables, need accurate deletes and ordering.
- File partitions: Batch feeds by date, S3/data lake style inputs.
Step-by-step: Design safe incremental logic
- Clarify business keys and required operations (insert/update/delete).
- Pick a change signal: updated_at, sequence, or CDC log.
- Define the watermark store and how it advances (max seen in committed batch).
- Add dedup strategy (primary key + latest event/time).
- Design merge/upsert logic and delete handling (soft vs hard delete).
- Plan for late arrivals (reprocess recent window, or use event-time).
- Add observability: counts in/out, rows inserted/updated/deleted, watermark values.
- Document backfill and retry procedures (date ranges, idempotency).
- Test with edge cases: duplicates, out-of-order, null timestamps, clock skew.
Self-check questions
- Is the load idempotent?
- What happens if the job runs twice?
- How are deletes represented?
- How is the watermark advanced and persisted?
Exercises
Complete the exercise below. Everyone can take it; save-your-progress is available for logged-in users.
Exercise 1: Design and write the merge for an incremental load
See the exercise card below for data and tasks. When done, check your solution here and in the exercise card.
- [ ] Chosen a robust watermark column
- [ ] Included dedup for multiple changes per key
- [ ] Implemented INSERT, UPDATE, and DELETE correctly
- [ ] Advanced watermark only after success
Common mistakes and how to self-check
- Using created_at as watermark: Misses updates. Use updated_at or CDC.
- Advancing watermark too early: If commit fails, you lose changes. Advance after successful write.
- No dedup: Multiple updates per key in window lead to churn. Keep only the latest per key.
- Ignoring deletes: Targets drift from source. Handle soft deletes or anti-join deletes.
- No overlap window: Late files/events get lost. Reprocess last N hours/days.
- Non-idempotent logic: Appending duplicates instead of merging. Always use MERGE/UPSERT semantics.
Quick self-audit checklist
- [ ] Watermark defined and persisted
- [ ] Idempotent merge verified by rerun test
- [ ] Late-arrival policy documented
- [ ] Metrics emitted: read, insert, update, delete, rejected
- [ ] Backfill procedure tested
Practical projects
- Build an incremental loader for a users table using updated_at watermark and a 2-day overlap window. Include metrics.
- Implement CDC-based upsert for orders with delete handling and out-of-order dedup.
- Create a file-based ingestion that reprocesses the last 3 partitions and updates a checkpoint table.
Learning path
- Review source schemas and pick change signals per table.
- Prototype a MERGE-based upsert on a small sample.
- Add dedup + late-arrival window.
- Introduce delete handling and observability.
- Test backfills and failure/retry behavior.
Mini challenge
Your source has unreliable updated_at (sometimes not updated on change). Propose a safer incremental strategy in 3–5 sentences using CDC or hash-diff snapshots. Explain how you will ensure idempotency and deletions.
Next steps
- Apply today’s logic to one real table in your environment.
- Run it twice deliberately to confirm idempotency.
- Add a reprocess window and observe metrics for a week.
About the test and saving progress
The quick test below is available to everyone. If you sign in, your results and progress will be saved.