luvv to helpDiscover the Best Free Online Tools
Topic 5 of 10

Incremental Loading Strategies

Learn Incremental Loading Strategies for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

As an Analytics Engineer, you rarely re-load entire tables. Incremental loading lets you move only new and changed data, cutting compute costs, reducing load time, and meeting SLAs. Real tasks include:

  • Designing incremental dbt models for large fact tables
  • Building Merge/Upsert logic to reflect updates and deletes from source systems
  • Maintaining Slowly Changing Dimensions (SCD) for historical reporting
  • Handling late-arriving records and deduplication safely

Concept explained simply

Incremental loading moves only the rows you haven’t processed yet (new, changed, or deleted), instead of scanning everything.

Mental model

Think of a bookmark in a long book. Each time you read, you start from your bookmark (watermark) and move forward. After reading, you place the bookmark at the new spot. If a page was edited behind your bookmark, you have a rule for revisiting a small window behind it to catch late changes.

Core strategies and when to use them

1) Timestamp-based incrementals (updated_at watermark)

Use when source tables have reliable updated_at. Load rows where updated_at > last_watermark. After success, set watermark = max(updated_at) processed.

  • Pros: Simple, fast
  • Cons: Requires trustworthy timestamps; misses deletes unless tracked separately
2) Change Data Capture (CDC) with Upserts/Deletes

Use CDC streams or logs that include operation types (I/U/D) and commit ordering. Apply MERGE to insert/update/delete.

  • Pros: Accurate, captures deletes and out-of-order changes
  • Cons: More moving parts; requires CDC infrastructure
3) Append-only + Deduplicate

If the source only appends, load the new slice (e.g., by ingestion_time), then deduplicate in the warehouse using a business key and last-write-wins rule.

4) Slowly Changing Dimension (SCD Type 2)

For dimensions where history matters. Keep multiple versions per key with effective_from/effective_to and a current flag. Use MERGE with date logic.

5) Partition/windowed backfill

Process data in daily/hourly windows. Good for catch-up backfills or reprocessing late data without scanning entire history.

6) Idempotency and late data

Design loads so reruns produce the same result (idempotent). Keep a small overlap window (e.g., 1–24 hours) behind the watermark to capture late-arriving updates, then deduplicate.

Worked examples

Example 1: Timestamp watermark with MERGE

Scenario: Load orders incrementally using updated_at. State table stores last watermark.

-- Assume state table: state_incremental(module STRING PRIMARY KEY, watermark_ts TIMESTAMP)
-- 1) Pull incremental slice with a small safety overlap (e.g., 2 hours)
WITH last AS (
  SELECT COALESCE(MAX(watermark_ts), TIMESTAMP '1970-01-01') AS w
  FROM state_incremental WHERE module = 'orders'
), slice AS (
  SELECT *
  FROM staging_orders s, last l
  WHERE s.updated_at > l.w - INTERVAL '2' HOUR
)
, dedup AS (
  SELECT s.*,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM slice s
)
MERGE INTO fact_orders t
USING (SELECT * FROM dedup WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  amount = s.amount,
  status = s.status,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, amount, status, updated_at)
VALUES (s.order_id, s.amount, s.status, s.updated_at);

-- 2) Advance watermark to the max processed updated_at
INSERT INTO state_incremental(module, watermark_ts)
SELECT 'orders', MAX(updated_at) FROM dedup
ON CONFLICT (module) DO UPDATE SET watermark_ts = EXCLUDED.watermark_ts;

Why it works: The overlap window plus deduplication handles late-arriving updates. The MERGE ensures idempotency.

Example 2: CDC upsert and delete

Scenario: CDC stream has columns: key, payload..., op (I/U/D), commit_ts.

WITH ranked AS (
  SELECT c.*, ROW_NUMBER() OVER (PARTITION BY key ORDER BY commit_ts DESC) rn
  FROM cdc_orders c
  WHERE commit_ts > (SELECT COALESCE(MAX(watermark_ts), TIMESTAMP '1970-01-01') FROM state_incremental WHERE module='orders_cdc')
)
MERGE INTO fact_orders t
USING (SELECT * FROM ranked WHERE rn = 1) s
ON (t.order_id = s.key)
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED AND s.op IN ('U','I') THEN UPDATE SET ... -- set latest values
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT (...)
VALUES (...);

-- Advance CDC watermark
INSERT INTO state_incremental(module, watermark_ts)
SELECT 'orders_cdc', MAX(commit_ts) FROM ranked
ON CONFLICT (module) DO UPDATE SET watermark_ts = EXCLUDED.watermark_ts;

Notes: CDC operation order matters; use commit_ts for correctness.

Example 3: SCD Type 2 dimension

Scenario: products dimension needs history on price changes.

-- Staging has one row per product_id with latest attributes
WITH incoming AS (
  SELECT product_id, name, price, updated_at FROM staging_products
)
-- Close current rows where something changed
UPDATE dim_product d
SET effective_to = i.updated_at - INTERVAL '1' SECOND,
    is_current = FALSE
FROM incoming i
WHERE d.product_id = i.product_id
  AND d.is_current = TRUE
  AND (d.name != i.name OR d.price != i.price);

-- Insert new current rows
INSERT INTO dim_product(product_id, name, price, effective_from, effective_to, is_current)
SELECT i.product_id, i.name, i.price, i.updated_at, TIMESTAMP '9999-12-31', TRUE
FROM incoming i
LEFT JOIN dim_product d
  ON d.product_id = i.product_id AND d.is_current = TRUE
WHERE d.product_id IS NULL
   OR (d.name != i.name OR d.price != i.price);

Handles late arrivals if updated_at is trusted and you re-run windows with idempotent logic.

Step-by-step setup checklist

  • Pick a reliable key and change signal (updated_at, commit_ts, or CDC op)
  • Create a state table to store watermarks per module
  • Add a safety overlap window to catch late data
  • Deduplicate on business key with last-write-wins
  • Use MERGE for idempotent upserts (and deletes if applicable)
  • Advance watermark only after successful load
  • Write tests: no duplicates after merge; row counts stable; SCD2 integrity
  • Monitor runtime, row counts, and watermark lag

Exercises

Try these. Then compare with the solutions.

Exercise 1 — Timestamp-based incremental MERGE

You have:

  • staging_orders(order_id, amount, status, updated_at)
  • fact_orders(order_id PRIMARY KEY, amount, status, updated_at)
  • state_incremental(module PRIMARY KEY, watermark_ts)

Task: Load only new/changed orders using updated_at with a 2-hour safety overlap. Deduplicate by order_id keeping the latest updated_at. Then update the watermark.

Edge cases: multiple updates to the same order_id; late updates arriving within the overlap; re-runs must be idempotent.

Expected output
  • fact_orders contains one row per order_id with the latest amount, status, updated_at
  • state_incremental.watermark_ts for module='orders' equals the max updated_at processed
  • Re-running does not change results
Hints
  • Compute last watermark; subtract 2 hours for the slice
  • Use ROW_NUMBER over order_id ordered by updated_at DESC
  • MERGE matched: UPDATE; not matched: INSERT
Show solution
WITH last AS (
  SELECT COALESCE(MAX(watermark_ts), TIMESTAMP '1970-01-01') AS w
  FROM state_incremental WHERE module='orders'
), slice AS (
  SELECT s.* FROM staging_orders s, last l
  WHERE s.updated_at > l.w - INTERVAL '2' HOUR
), dedup AS (
  SELECT s.*, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) rn
  FROM slice s
)
MERGE INTO fact_orders t
USING (SELECT * FROM dedup WHERE rn=1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET amount=s.amount, status=s.status, updated_at=s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, amount, status, updated_at)
VALUES(s.order_id, s.amount, s.status, s.updated_at);

INSERT INTO state_incremental(module, watermark_ts)
SELECT 'orders', MAX(updated_at) FROM dedup
ON CONFLICT (module) DO UPDATE SET watermark_ts=EXCLUDED.watermark_ts;

Exercise 2 — Build SCD2 for products

You have:

  • staging_products(product_id, name, price, updated_at)
  • dim_product(product_id, name, price, effective_from, effective_to, is_current)

Task: Implement SCD2 so that when name or price changes, the current record is end-dated and a new current record is inserted starting at updated_at.

Edge cases: Late-arriving change (older updated_at than current effective_from). Ensure no overlapping effective ranges for the same product_id.

Expected output
  • Only one is_current=TRUE row per product_id
  • No overlapping effective ranges
  • Historical rows preserved with correct effective_to
Hints
  • Close current rows where attributes differ
  • Insert a new row with effective_to = 9999-12-31
  • Consider subtracting 1 second (or smallest unit) when setting effective_to
Show solution
WITH incoming AS (
  SELECT product_id, name, price, updated_at FROM staging_products
)
UPDATE dim_product d
SET effective_to = i.updated_at - INTERVAL '1' SECOND,
    is_current = FALSE
FROM incoming i
WHERE d.product_id = i.product_id
  AND d.is_current = TRUE
  AND (d.name != i.name OR d.price != i.price)
  AND d.effective_from <= i.updated_at
  AND d.effective_to >= i.updated_at;

INSERT INTO dim_product(product_id, name, price, effective_from, effective_to, is_current)
SELECT i.product_id, i.name, i.price, i.updated_at, TIMESTAMP '9999-12-31', TRUE
FROM incoming i
LEFT JOIN dim_product d
  ON d.product_id = i.product_id AND d.is_current = TRUE
WHERE d.product_id IS NULL
   OR (d.name != i.name OR d.price != i.price);

Common mistakes and self-check

  • Relying on unreliable timestamps: Self-check by comparing counts of changed rows against CDC or source logs.
  • No overlap window: Add 1–24 hours overlap and verify late updates appear in downstream reports.
  • Non-idempotent scripts: Re-run your job twice; results must be identical.
  • Forgetting deletes: If source can delete, use CDC or a deleted flag and handle it in MERGE.
  • Duplicate rows after merge: Add ROW_NUMBER dedup logic and a uniqueness test on keys.
  • Over-advancing watermark: Advance only after successful load and use the max processed timestamp.

Practical projects

  • Build an incremental pipeline for a sales fact table using updated_at with a 6-hour overlap and deduplication.
  • Create a SCD2 customer dimension with tests ensuring one current row per customer_id.
  • Prototype a CDC-based MERGE for orders with delete handling and a state table.

Who this is for

Aspiring and practicing Analytics Engineers, BI Developers, and Data Engineers who need reliable and cost-effective warehouse loads.

Prerequisites

  • SQL joins, window functions, and MERGE/UPSERT basics
  • Understanding of primary keys and constraints
  • Familiarity with batch scheduling and warehouse tables

Learning path

  • Start with timestamp-based incrementals
  • Learn CDC upserts and delete handling
  • Implement SCD2 for one dimension
  • Add idempotency tests and monitoring
  • Handle late data and backfills safely

Next steps

Practice with the exercises above, then take the quick test below to confirm your understanding.

Mini challenge

Your source ERP table has unreliable updated_at but guarantees monotonically increasing surrogate id per change. Design an incremental pattern to load changes and handle deletes. Outline the state you will store, your MERGE keys, and how you will detect deletes. Keep it idempotent.

Quick Test

Available to everyone; log in to save your progress.

Practice Exercises

2 exercises to complete

Instructions

You have:
  • staging_orders(order_id, amount, status, updated_at)
  • fact_orders(order_id PRIMARY KEY, amount, status, updated_at)
  • state_incremental(module PRIMARY KEY, watermark_ts)
Task: Load only new/changed orders using updated_at with a 2-hour safety overlap. Deduplicate by order_id keeping the latest updated_at. Then update the watermark.
Edge cases: multiple updates to the same order_id; late updates arriving within the overlap; re-runs must be idempotent.
Expected Output
fact_orders has one row per order_id with latest values; state_incremental.watermark_ts for module='orders' equals the max updated_at processed; re-running is a no-op.

Incremental Loading Strategies — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Incremental Loading Strategies?

AI Assistant

Ask questions about this tool