How to learn Incremental Loading Strategies for ETL ELT Patterns in Analytics Engineer for free

Why this matters

As an Analytics Engineer, you rarely re-load entire tables. Incremental loading lets you move only new and changed data, cutting compute costs, reducing load time, and meeting SLAs. Real tasks include:

Designing incremental dbt models for large fact tables
Building Merge/Upsert logic to reflect updates and deletes from source systems
Maintaining Slowly Changing Dimensions (SCD) for historical reporting
Handling late-arriving records and deduplication safely

Concept explained simply

Incremental loading moves only the rows you haven’t processed yet (new, changed, or deleted), instead of scanning everything.

Mental model

Think of a bookmark in a long book. Each time you read, you start from your bookmark (watermark) and move forward. After reading, you place the bookmark at the new spot. If a page was edited behind your bookmark, you have a rule for revisiting a small window behind it to catch late changes.

Core strategies and when to use them

1) Timestamp-based incrementals (updated_at watermark)

Use when source tables have reliable updated_at. Load rows where updated_at > last_watermark. After success, set watermark = max(updated_at) processed.

Pros: Simple, fast
Cons: Requires trustworthy timestamps; misses deletes unless tracked separately

2) Change Data Capture (CDC) with Upserts/Deletes

Use CDC streams or logs that include operation types (I/U/D) and commit ordering. Apply MERGE to insert/update/delete.

Pros: Accurate, captures deletes and out-of-order changes
Cons: More moving parts; requires CDC infrastructure

3) Append-only + Deduplicate

If the source only appends, load the new slice (e.g., by ingestion_time), then deduplicate in the warehouse using a business key and last-write-wins rule.

4) Slowly Changing Dimension (SCD Type 2)

For dimensions where history matters. Keep multiple versions per key with effective_from/effective_to and a current flag. Use MERGE with date logic.

5) Partition/windowed backfill

Process data in daily/hourly windows. Good for catch-up backfills or reprocessing late data without scanning entire history.

6) Idempotency and late data

Design loads so reruns produce the same result (idempotent). Keep a small overlap window (e.g., 1–24 hours) behind the watermark to capture late-arriving updates, then deduplicate.

Worked examples

Example 1: Timestamp watermark with MERGE

Scenario: Load orders incrementally using updated_at. State table stores last watermark.

-- Assume state table: state_incremental(module STRING PRIMARY KEY, watermark_ts TIMESTAMP)
-- 1) Pull incremental slice with a small safety overlap (e.g., 2 hours)
WITH last AS (
  SELECT COALESCE(MAX(watermark_ts), TIMESTAMP '1970-01-01') AS w
  FROM state_incremental WHERE module = 'orders'
), slice AS (
  SELECT *
  FROM staging_orders s, last l
  WHERE s.updated_at > l.w - INTERVAL '2' HOUR
)
, dedup AS (
  SELECT s.*,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM slice s
)
MERGE INTO fact_orders t
USING (SELECT * FROM dedup WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  amount = s.amount,
  status = s.status,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, amount, status, updated_at)
VALUES (s.order_id, s.amount, s.status, s.updated_at);

-- 2) Advance watermark to the max processed updated_at
INSERT INTO state_incremental(module, watermark_ts)
SELECT 'orders', MAX(updated_at) FROM dedup
ON CONFLICT (module) DO UPDATE SET watermark_ts = EXCLUDED.watermark_ts;

Why it works: The overlap window plus deduplication handles late-arriving updates. The MERGE ensures idempotency.

Example 2: CDC upsert and delete

Scenario: CDC stream has columns: key, payload..., op (I/U/D), commit_ts.

WITH ranked AS (
  SELECT c.*, ROW_NUMBER() OVER (PARTITION BY key ORDER BY commit_ts DESC) rn
  FROM cdc_orders c
  WHERE commit_ts > (SELECT COALESCE(MAX(watermark_ts), TIMESTAMP '1970-01-01') FROM state_incremental WHERE module='orders_cdc')
)
MERGE INTO fact_orders t
USING (SELECT * FROM ranked WHERE rn = 1) s
ON (t.order_id = s.key)
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED AND s.op IN ('U','I') THEN UPDATE SET ... -- set latest values
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT (...)
VALUES (...);

-- Advance CDC watermark
INSERT INTO state_incremental(module, watermark_ts)
SELECT 'orders_cdc', MAX(commit_ts) FROM ranked
ON CONFLICT (module) DO UPDATE SET watermark_ts = EXCLUDED.watermark_ts;

Notes: CDC operation order matters; use commit_ts for correctness.

Example 3: SCD Type 2 dimension

Scenario: products dimension needs history on price changes.

-- Staging has one row per product_id with latest attributes
WITH incoming AS (
  SELECT product_id, name, price, updated_at FROM staging_products
)
-- Close current rows where something changed
UPDATE dim_product d
SET effective_to = i.updated_at - INTERVAL '1' SECOND,
    is_current = FALSE
FROM incoming i
WHERE d.product_id = i.product_id
  AND d.is_current = TRUE
  AND (d.name != i.name OR d.price != i.price);

-- Insert new current rows
INSERT INTO dim_product(product_id, name, price, effective_from, effective_to, is_current)
SELECT i.product_id, i.name, i.price, i.updated_at, TIMESTAMP '9999-12-31', TRUE
FROM incoming i
LEFT JOIN dim_product d
  ON d.product_id = i.product_id AND d.is_current = TRUE
WHERE d.product_id IS NULL
   OR (d.name != i.name OR d.price != i.price);

Handles late arrivals if updated_at is trusted and you re-run windows with idempotent logic.

Step-by-step setup checklist

Pick a reliable key and change signal (updated_at, commit_ts, or CDC op)
Create a state table to store watermarks per module
Add a safety overlap window to catch late data
Deduplicate on business key with last-write-wins
Use MERGE for idempotent upserts (and deletes if applicable)
Advance watermark only after successful load
Write tests: no duplicates after merge; row counts stable; SCD2 integrity
Monitor runtime, row counts, and watermark lag

Exercises

Try these. Then compare with the solutions.

Exercise 1 — Timestamp-based incremental MERGE

You have:

staging_orders(order_id, amount, status, updated_at)
fact_orders(order_id PRIMARY KEY, amount, status, updated_at)
state_incremental(module PRIMARY KEY, watermark_ts)

Task: Load only new/changed orders using updated_at with a 2-hour safety overlap. Deduplicate by order_id keeping the latest updated_at. Then update the watermark.

Edge cases: multiple updates to the same order_id; late updates arriving within the overlap; re-runs must be idempotent.

Expected output

fact_orders contains one row per order_id with the latest amount, status, updated_at
state_incremental.watermark_ts for module='orders' equals the max updated_at processed
Re-running does not change results

Hints

Compute last watermark; subtract 2 hours for the slice
Use ROW_NUMBER over order_id ordered by updated_at DESC
MERGE matched: UPDATE; not matched: INSERT

Show solution

WITH last AS (
  SELECT COALESCE(MAX(watermark_ts), TIMESTAMP '1970-01-01') AS w
  FROM state_incremental WHERE module='orders'
), slice AS (
  SELECT s.* FROM staging_orders s, last l
  WHERE s.updated_at > l.w - INTERVAL '2' HOUR
), dedup AS (
  SELECT s.*, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) rn
  FROM slice s
)
MERGE INTO fact_orders t
USING (SELECT * FROM dedup WHERE rn=1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET amount=s.amount, status=s.status, updated_at=s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, amount, status, updated_at)
VALUES(s.order_id, s.amount, s.status, s.updated_at);

INSERT INTO state_incremental(module, watermark_ts)
SELECT 'orders', MAX(updated_at) FROM dedup
ON CONFLICT (module) DO UPDATE SET watermark_ts=EXCLUDED.watermark_ts;

Exercise 2 — Build SCD2 for products

You have:

staging_products(product_id, name, price, updated_at)
dim_product(product_id, name, price, effective_from, effective_to, is_current)

Task: Implement SCD2 so that when name or price changes, the current record is end-dated and a new current record is inserted starting at updated_at.

Edge cases: Late-arriving change (older updated_at than current effective_from). Ensure no overlapping effective ranges for the same product_id.

Expected output

Only one is_current=TRUE row per product_id
No overlapping effective ranges
Historical rows preserved with correct effective_to

Hints

Close current rows where attributes differ
Insert a new row with effective_to = 9999-12-31
Consider subtracting 1 second (or smallest unit) when setting effective_to

Show solution

WITH incoming AS (
  SELECT product_id, name, price, updated_at FROM staging_products
)
UPDATE dim_product d
SET effective_to = i.updated_at - INTERVAL '1' SECOND,
    is_current = FALSE
FROM incoming i
WHERE d.product_id = i.product_id
  AND d.is_current = TRUE
  AND (d.name != i.name OR d.price != i.price)
  AND d.effective_from <= i.updated_at
  AND d.effective_to >= i.updated_at;

INSERT INTO dim_product(product_id, name, price, effective_from, effective_to, is_current)
SELECT i.product_id, i.name, i.price, i.updated_at, TIMESTAMP '9999-12-31', TRUE
FROM incoming i
LEFT JOIN dim_product d
  ON d.product_id = i.product_id AND d.is_current = TRUE
WHERE d.product_id IS NULL
   OR (d.name != i.name OR d.price != i.price);

Common mistakes and self-check

Relying on unreliable timestamps: Self-check by comparing counts of changed rows against CDC or source logs.
No overlap window: Add 1–24 hours overlap and verify late updates appear in downstream reports.
Non-idempotent scripts: Re-run your job twice; results must be identical.
Forgetting deletes: If source can delete, use CDC or a deleted flag and handle it in MERGE.
Duplicate rows after merge: Add ROW_NUMBER dedup logic and a uniqueness test on keys.
Over-advancing watermark: Advance only after successful load and use the max processed timestamp.

Practical projects

Build an incremental pipeline for a sales fact table using updated_at with a 6-hour overlap and deduplication.
Create a SCD2 customer dimension with tests ensuring one current row per customer_id.
Prototype a CDC-based MERGE for orders with delete handling and a state table.

Who this is for

Aspiring and practicing Analytics Engineers, BI Developers, and Data Engineers who need reliable and cost-effective warehouse loads.

Prerequisites

SQL joins, window functions, and MERGE/UPSERT basics
Understanding of primary keys and constraints
Familiarity with batch scheduling and warehouse tables

Learning path

Start with timestamp-based incrementals
Learn CDC upserts and delete handling
Implement SCD2 for one dimension
Add idempotency tests and monitoring
Handle late data and backfills safely

Next steps

Practice with the exercises above, then take the quick test below to confirm your understanding.

Mini challenge

Your source ERP table has unreliable updated_at but guarantees monotonically increasing surrogate id per change. Design an incremental pattern to load changes and handle deletes. Outline the state you will store, your MERGE keys, and how you will detect deletes. Keep it idempotent.

Quick Test

Available to everyone; log in to save your progress.

Menu

Incremental Loading Strategies

Table of Contents

Why this matters

Concept explained simply

Mental model

Core strategies and when to use them

Worked examples

Step-by-step setup checklist

Exercises

Exercise 1 — Timestamp-based incremental MERGE

Exercise 2 — Build SCD2 for products

Common mistakes and self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Timestamp-based incremental MERGE

Instructions

Expected Output

Build SCD2 for products

Incremental Loading Strategies — Quick Test

Have questions about Incremental Loading Strategies?

AI Assistant