luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Defining Incremental Logic

Learn Defining Incremental Logic for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Who this is for

ETL Developers, Data Engineers, and Analysts who need to load only changed data reliably from source systems into data warehouses or lakes.

Prerequisites

  • Comfort with SQL (JOIN, WHERE, GROUP BY, window functions)
  • Basic ETL/ELT concepts (extract, stage, transform, load)
  • Familiarity with tables, primary keys, and timestamps

Why this matters

In real teams you will:

  • Load millions of rows nightly without overwhelming sources or targets
  • Refresh dashboards hourly with only changed records
  • Handle late-arriving data and deletes correctly
  • Backfill safely after outages without duplicates

Incremental logic turns these into fast, reliable, and cost-effective pipelines.

Concept explained simply

Incremental loading means processing only what changed since the last run. You remember a checkpoint (watermark), fetch new/updated rows after that point, and merge them into the target. If done right, running the job twice produces the same result (idempotency).

Mental model: Bookmarks and pages

Imagine reading a long book every day. You place a bookmark (watermark) on the last page you finished. Next day, you start from the bookmark and continue. If you misplace a page, you re-read the last few pages (overlap window) to be safe. When a page is removed (delete), you mark it missing in your notes.

Key components of incremental logic

  • Watermark: A stored value marking the last processed point (e.g., max updated_at, last event_id, or CDC log position).
  • Change detection: Using updated_at, version numbers, or Change Data Capture (CDC) logs with operation codes (I/U/D).
  • Merge/Upsert: Applying changes via SQL MERGE or equivalent logic to insert, update, or mark deletes.
  • Idempotency: Re-running a job results in the same target state. Typically achieved via deterministic merges, deduping, and careful watermark updates.
  • Deduplication: Remove duplicates using primary keys + latest timestamp or event order (row_number).
  • Late-arriving data: Allow a small reprocess window (e.g., last 1–3 days) or use event-time windows to catch late events.
  • Delete handling: From CDC delete events or via anti-join (if the source is a full snapshot) to soft-delete rows.
  • Backfills and retries: Support reprocessing specific date ranges and safe retries without double-counting.
  • Observability: Record counts read/written, inserts/updates/deletes, and watermark values for traceability.

Worked examples

Example 1: Timestamp watermark (OLTP → DWH)

Source: orders(id, status, amount, updated_at). Target: dwh.orders keyed by id.

-- Choose watermark as source updated_at (indexed, monotonic per row)
WITH inc AS (
  SELECT *
  FROM src.orders
  WHERE updated_at > :last_watermark
    AND updated_at <= :current_cutoff
), latest AS (
  SELECT * FROM (
    SELECT o.*, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM inc o
  ) x WHERE rn = 1
)
MERGE INTO dwh.orders t
USING latest s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (id, status, amount, updated_at)
VALUES (s.id, s.status, s.amount, s.updated_at);

-- Advance watermark after successful commit
-- :last_watermark := (SELECT COALESCE(MAX(updated_at), :last_watermark) FROM latest);

Notes: Dedup with row_number, merge deterministically, and update watermark only after success.

Example 2: CDC with operation codes

Source CDC stream: orders_cdc(id, op, updated_at, ...), where op in ('I','U','D').

WITH seq AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at) AS rn
  FROM src.orders_cdc
  WHERE updated_at > :last_watermark AND updated_at <= :current_cutoff
), final_events AS (
  SELECT * FROM seq WHERE rn = (SELECT MAX(rn) FROM seq s2 WHERE s2.id = seq.id)
)
MERGE INTO dwh.orders t
USING final_events s ON t.id = s.id
WHEN MATCHED AND s.op = 'D' THEN UPDATE SET is_deleted = TRUE, updated_at = s.updated_at
WHEN MATCHED AND s.op IN ('I','U') THEN UPDATE SET col1 = s.col1, col2 = s.col2, is_deleted = FALSE, updated_at = s.updated_at
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT (..., is_deleted, updated_at) VALUES (..., FALSE, s.updated_at);

Notes: Keep the latest event per id within the window; mark deletes without physically removing unless required.

Example 3: File-based loads (daily partitions)

Landing zone: /raw/sales/dt=YYYY-MM-DD/*.parquet. Reprocess recent days to catch late files.

-- For run date R, process range [R - overlap_days, R]
-- Maintain processed_partitions table to track dt processed and row counts

-- Pseudocode:
for dt in dates_between(R - overlap_days, R):
  df = read("/raw/sales/dt=" + dt)
  df_dedup = df.with_row_number(partitionBy=["order_id"], orderBy=["event_time"].desc).where(rn==1)
  merge_into_target(df_dedup, key=["order_id"], event_time_col="event_time")
  upsert_processed_partition(dt, rows_written=df_dedup.count())

Notes: Track processed partitions, deduplicate, and keep a small overlap window.

When to choose which method
  • Timestamp watermark: Simple tables with reliable updated_at.
  • CDC: High-change tables, need accurate deletes and ordering.
  • File partitions: Batch feeds by date, S3/data lake style inputs.

Step-by-step: Design safe incremental logic

  1. Clarify business keys and required operations (insert/update/delete).
  2. Pick a change signal: updated_at, sequence, or CDC log.
  3. Define the watermark store and how it advances (max seen in committed batch).
  4. Add dedup strategy (primary key + latest event/time).
  5. Design merge/upsert logic and delete handling (soft vs hard delete).
  6. Plan for late arrivals (reprocess recent window, or use event-time).
  7. Add observability: counts in/out, rows inserted/updated/deleted, watermark values.
  8. Document backfill and retry procedures (date ranges, idempotency).
  9. Test with edge cases: duplicates, out-of-order, null timestamps, clock skew.
Self-check questions
  • Is the load idempotent?
  • What happens if the job runs twice?
  • How are deletes represented?
  • How is the watermark advanced and persisted?

Exercises

Complete the exercise below. Everyone can take it; save-your-progress is available for logged-in users.

Exercise 1: Design and write the merge for an incremental load

See the exercise card below for data and tasks. When done, check your solution here and in the exercise card.

  • [ ] Chosen a robust watermark column
  • [ ] Included dedup for multiple changes per key
  • [ ] Implemented INSERT, UPDATE, and DELETE correctly
  • [ ] Advanced watermark only after success

Common mistakes and how to self-check

  • Using created_at as watermark: Misses updates. Use updated_at or CDC.
  • Advancing watermark too early: If commit fails, you lose changes. Advance after successful write.
  • No dedup: Multiple updates per key in window lead to churn. Keep only the latest per key.
  • Ignoring deletes: Targets drift from source. Handle soft deletes or anti-join deletes.
  • No overlap window: Late files/events get lost. Reprocess last N hours/days.
  • Non-idempotent logic: Appending duplicates instead of merging. Always use MERGE/UPSERT semantics.
Quick self-audit checklist
  • [ ] Watermark defined and persisted
  • [ ] Idempotent merge verified by rerun test
  • [ ] Late-arrival policy documented
  • [ ] Metrics emitted: read, insert, update, delete, rejected
  • [ ] Backfill procedure tested

Practical projects

  • Build an incremental loader for a users table using updated_at watermark and a 2-day overlap window. Include metrics.
  • Implement CDC-based upsert for orders with delete handling and out-of-order dedup.
  • Create a file-based ingestion that reprocesses the last 3 partitions and updates a checkpoint table.

Learning path

  1. Review source schemas and pick change signals per table.
  2. Prototype a MERGE-based upsert on a small sample.
  3. Add dedup + late-arrival window.
  4. Introduce delete handling and observability.
  5. Test backfills and failure/retry behavior.

Mini challenge

Your source has unreliable updated_at (sometimes not updated on change). Propose a safer incremental strategy in 3–5 sentences using CDC or hash-diff snapshots. Explain how you will ensure idempotency and deletions.

Next steps

  • Apply today’s logic to one real table in your environment.
  • Run it twice deliberately to confirm idempotency.
  • Add a reprocess window and observe metrics for a week.

About the test and saving progress

The quick test below is available to everyone. If you sign in, your results and progress will be saved.

Practice Exercises

1 exercises to complete

Instructions

Scenario: You ingest src.orders(id PK, status, amount, is_deleted BOOL, updated_at TIMESTAMP). Target is dwh.orders with the same columns plus created_at (set on first insert).

Given variables:

  • :last_watermark (TIMESTAMP) — stored from previous successful run
  • :current_cutoff (TIMESTAMP) — end of this run’s window

Tasks:

  1. Select the change window and deduplicate so only the latest change per id in the window is used.
  2. Write a single MERGE (or equivalent UPSERT) that:
  • Inserts new rows when not matched and is_deleted = false
  • Updates existing rows when matched and is_deleted = false
  • Soft-deletes existing rows when matched and is_deleted = true
  1. Advance the watermark safely.

Edge cases to consider: multiple updates for same id within the window; rows with is_deleted = true that were never in target; null status (should be written as null if present).

Expected Output
A correct incremental query that processes only rows with updated_at within the window, keeps the latest per id, merges deterministically (insert/update/soft-delete), and advances the watermark to the max(updated_at) of processed rows after a successful commit.

Defining Incremental Logic — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Defining Incremental Logic?

AI Assistant

Ask questions about this tool