How to learn Defining Incremental Logic for ETL Requirements And Source Analysis in ETL Developer for free

Who this is for

ETL Developers, Data Engineers, and Analysts who need to load only changed data reliably from source systems into data warehouses or lakes.

Prerequisites

Comfort with SQL (JOIN, WHERE, GROUP BY, window functions)
Basic ETL/ELT concepts (extract, stage, transform, load)
Familiarity with tables, primary keys, and timestamps

Why this matters

In real teams you will:

Load millions of rows nightly without overwhelming sources or targets
Refresh dashboards hourly with only changed records
Handle late-arriving data and deletes correctly
Backfill safely after outages without duplicates

Incremental logic turns these into fast, reliable, and cost-effective pipelines.

Concept explained simply

Incremental loading means processing only what changed since the last run. You remember a checkpoint (watermark), fetch new/updated rows after that point, and merge them into the target. If done right, running the job twice produces the same result (idempotency).

Mental model: Bookmarks and pages

Imagine reading a long book every day. You place a bookmark (watermark) on the last page you finished. Next day, you start from the bookmark and continue. If you misplace a page, you re-read the last few pages (overlap window) to be safe. When a page is removed (delete), you mark it missing in your notes.

Key components of incremental logic

Watermark: A stored value marking the last processed point (e.g., max updated_at, last event_id, or CDC log position).
Change detection: Using updated_at, version numbers, or Change Data Capture (CDC) logs with operation codes (I/U/D).
Merge/Upsert: Applying changes via SQL MERGE or equivalent logic to insert, update, or mark deletes.
Idempotency: Re-running a job results in the same target state. Typically achieved via deterministic merges, deduping, and careful watermark updates.
Deduplication: Remove duplicates using primary keys + latest timestamp or event order (row_number).
Late-arriving data: Allow a small reprocess window (e.g., last 1–3 days) or use event-time windows to catch late events.
Delete handling: From CDC delete events or via anti-join (if the source is a full snapshot) to soft-delete rows.
Backfills and retries: Support reprocessing specific date ranges and safe retries without double-counting.
Observability: Record counts read/written, inserts/updates/deletes, and watermark values for traceability.

Worked examples

Example 1: Timestamp watermark (OLTP → DWH)

Source: orders(id, status, amount, updated_at). Target: dwh.orders keyed by id.

-- Choose watermark as source updated_at (indexed, monotonic per row)
WITH inc AS (
  SELECT *
  FROM src.orders
  WHERE updated_at > :last_watermark
    AND updated_at <= :current_cutoff
), latest AS (
  SELECT * FROM (
    SELECT o.*, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM inc o
  ) x WHERE rn = 1
)
MERGE INTO dwh.orders t
USING latest s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (id, status, amount, updated_at)
VALUES (s.id, s.status, s.amount, s.updated_at);

-- Advance watermark after successful commit
-- :last_watermark := (SELECT COALESCE(MAX(updated_at), :last_watermark) FROM latest);

Notes: Dedup with row_number, merge deterministically, and update watermark only after success.

Example 2: CDC with operation codes

Source CDC stream: orders_cdc(id, op, updated_at, ...), where op in ('I','U','D').

WITH seq AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at) AS rn
  FROM src.orders_cdc
  WHERE updated_at > :last_watermark AND updated_at <= :current_cutoff
), final_events AS (
  SELECT * FROM seq WHERE rn = (SELECT MAX(rn) FROM seq s2 WHERE s2.id = seq.id)
)
MERGE INTO dwh.orders t
USING final_events s ON t.id = s.id
WHEN MATCHED AND s.op = 'D' THEN UPDATE SET is_deleted = TRUE, updated_at = s.updated_at
WHEN MATCHED AND s.op IN ('I','U') THEN UPDATE SET col1 = s.col1, col2 = s.col2, is_deleted = FALSE, updated_at = s.updated_at
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT (..., is_deleted, updated_at) VALUES (..., FALSE, s.updated_at);

Notes: Keep the latest event per id within the window; mark deletes without physically removing unless required.

Example 3: File-based loads (daily partitions)

Landing zone: /raw/sales/dt=YYYY-MM-DD/*.parquet. Reprocess recent days to catch late files.

-- For run date R, process range [R - overlap_days, R]
-- Maintain processed_partitions table to track dt processed and row counts

-- Pseudocode:
for dt in dates_between(R - overlap_days, R):
  df = read("/raw/sales/dt=" + dt)
  df_dedup = df.with_row_number(partitionBy=["order_id"], orderBy=["event_time"].desc).where(rn==1)
  merge_into_target(df_dedup, key=["order_id"], event_time_col="event_time")
  upsert_processed_partition(dt, rows_written=df_dedup.count())

Notes: Track processed partitions, deduplicate, and keep a small overlap window.

When to choose which method

Timestamp watermark: Simple tables with reliable updated_at.
CDC: High-change tables, need accurate deletes and ordering.
File partitions: Batch feeds by date, S3/data lake style inputs.

Step-by-step: Design safe incremental logic

Clarify business keys and required operations (insert/update/delete).
Pick a change signal: updated_at, sequence, or CDC log.
Define the watermark store and how it advances (max seen in committed batch).
Add dedup strategy (primary key + latest event/time).
Design merge/upsert logic and delete handling (soft vs hard delete).
Plan for late arrivals (reprocess recent window, or use event-time).
Add observability: counts in/out, rows inserted/updated/deleted, watermark values.
Document backfill and retry procedures (date ranges, idempotency).
Test with edge cases: duplicates, out-of-order, null timestamps, clock skew.

Self-check questions

Is the load idempotent?
What happens if the job runs twice?
How are deletes represented?
How is the watermark advanced and persisted?

Exercises

Complete the exercise below. Everyone can take it; save-your-progress is available for logged-in users.

Exercise 1: Design and write the merge for an incremental load

See the exercise card below for data and tasks. When done, check your solution here and in the exercise card.

[ ] Chosen a robust watermark column
[ ] Included dedup for multiple changes per key
[ ] Implemented INSERT, UPDATE, and DELETE correctly
[ ] Advanced watermark only after success

Common mistakes and how to self-check

Using created_at as watermark: Misses updates. Use updated_at or CDC.
Advancing watermark too early: If commit fails, you lose changes. Advance after successful write.
No dedup: Multiple updates per key in window lead to churn. Keep only the latest per key.
Ignoring deletes: Targets drift from source. Handle soft deletes or anti-join deletes.
No overlap window: Late files/events get lost. Reprocess last N hours/days.
Non-idempotent logic: Appending duplicates instead of merging. Always use MERGE/UPSERT semantics.

Quick self-audit checklist

[ ] Watermark defined and persisted
[ ] Idempotent merge verified by rerun test
[ ] Late-arrival policy documented
[ ] Metrics emitted: read, insert, update, delete, rejected
[ ] Backfill procedure tested

Practical projects

Build an incremental loader for a users table using updated_at watermark and a 2-day overlap window. Include metrics.
Implement CDC-based upsert for orders with delete handling and out-of-order dedup.
Create a file-based ingestion that reprocesses the last 3 partitions and updates a checkpoint table.

Learning path

Review source schemas and pick change signals per table.
Prototype a MERGE-based upsert on a small sample.
Add dedup + late-arrival window.
Introduce delete handling and observability.
Test backfills and failure/retry behavior.

Mini challenge

Your source has unreliable updated_at (sometimes not updated on change). Propose a safer incremental strategy in 3–5 sentences using CDC or hash-diff snapshots. Explain how you will ensure idempotency and deletions.

Next steps

Apply today’s logic to one real table in your environment.
Run it twice deliberately to confirm idempotency.
Add a reprocess window and observe metrics for a week.

About the test and saving progress

The quick test below is available to everyone. If you sign in, your results and progress will be saved.

Instructions

Scenario: You ingest src.orders(id PK, status, amount, is_deleted BOOL, updated_at TIMESTAMP). Target is dwh.orders with the same columns plus created_at (set on first insert).

Given variables:

:last_watermark (TIMESTAMP) — stored from previous successful run
:current_cutoff (TIMESTAMP) — end of this run’s window

Tasks:

Select the change window and deduplicate so only the latest change per id in the window is used.
Write a single MERGE (or equivalent UPSERT) that:

Inserts new rows when not matched and is_deleted = false
Updates existing rows when matched and is_deleted = false
Soft-deletes existing rows when matched and is_deleted = true

Advance the watermark safely.

Edge cases to consider: multiple updates for same id within the window; rows with is_deleted = true that were never in target; null status (should be written as null if present).

Menu

Defining Incremental Logic

Table of Contents