How to learn Incremental Models And Materializations for Data Warehousing in Data Engineer for free

Why this matters

Incremental models let you process only new or changed data instead of rebuilding entire tables. This keeps pipelines fast, costs low, and datasets fresh. As a Data Engineer, you will:

Load large facts (orders, events) without full refreshes.
Upsert corrections and late-arriving data safely.
Maintain Slowly Changing Dimensions (SCD) for accurate history.
Choose the right materialization (view, table, incremental) to balance freshness, cost, and reliability.

Who this is for

Data Engineers and Analytics Engineers working with SQL warehouses.
Anyone using tools that support incremental builds (e.g., SQL MERGE-based pipelines).

Prerequisites

Comfortable with SQL (JOINs, window functions, INSERT/UPDATE/MERGE).
Basic data modeling: facts, dimensions, keys.
Familiar with your warehouse’s partitioning/cluster and transaction behavior.

Concept explained simply

An incremental model stores its previous results and, on each run, only processes a slice (new/updated rows). That slice is merged into the target table. This makes runs faster and cheaper than full rebuilds.

Mental model

Think of a library that only shelves new books and replaces corrected copies—no need to re-shelve the whole library each day.
Two pillars: a trustworthy key (to match rows) and a trustworthy change signal (timestamp or version) to know what to process.
Idempotency: re-running the same slice shouldn’t duplicate or corrupt data.

Materializations at a glance

View: always computes on read. Great for tiny, simple refs; not ideal for heavy aggregations.
Table: computed once and stored. Great when stability/speed matters and data isn’t huge or super-fresh.
Incremental table: stored plus only new/changed rows processed each run. Ideal for large, regularly updated datasets.

Designing a safe incremental model

Pick keys: a unique identifier per row (e.g., order_id, event_id).
Pick a change detector: updated_at or event_time. If absent, derive a hash/version.
Stage and clean: deduplicate the slice using window functions or hashes.
Slice safely: filter the source to a time window using a watermark + a small lookback (e.g., 1–3 days) to catch late events.
Merge strategy: MERGE (upsert) where supported; else transaction with UPDATE then INSERT.
Partitions/clustering: align with time columns used in slicing to prune reads.
Idempotency: ensure re-running the same slice yields the same result.
Validation: row counts per day, uniqueness, and change rates.

Worked examples

Example 1: Incremental fact_orders with MERGE

Goal: Upsert orders by order_id using updated_at with a 2-day lookback.

WITH watermark AS (
  SELECT COALESCE(MAX(updated_at), TIMESTAMP '1900-01-01') AS last_ts
  FROM analytics.fact_orders
), raw_slice AS (
  SELECT *
  FROM staging.orders
  WHERE updated_at >= (SELECT last_ts - INTERVAL '2' DAY FROM watermark)
), deduped AS (
  SELECT *
  FROM (
    SELECT o.*,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
    FROM raw_slice o
  ) x
  WHERE rn = 1
)
MERGE INTO analytics.fact_orders t
USING deduped s
ON t.order_id = s.order_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET
  customer_id = s.customer_id,
  status      = s.status,
  amount      = s.amount,
  updated_at  = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, customer_id, status, amount, created_at, updated_at)
VALUES (s.order_id, s.customer_id, s.status, s.amount, s.created_at, s.updated_at);

Why it works:

Lookback captures late updates.
Dedup ensures one row per order_id from the slice.
MERGE updates changed rows and inserts new ones.

Example 2: Append-only events, dedupe + insert

Goal: Ingest clickstream events by event_id, append only, avoid duplicates.

WITH watermark AS (
  SELECT COALESCE(MAX(event_time), TIMESTAMP '1900-01-01') AS last_ts
  FROM analytics.events
), raw_slice AS (
  SELECT *
  FROM landing.events
  WHERE event_time >= (SELECT last_ts - INTERVAL '1' DAY FROM watermark)
), deduped AS (
  SELECT * FROM (
    SELECT e.*,
           ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time DESC) AS rn
    FROM raw_slice e
  ) z WHERE rn = 1
), new_only AS (
  SELECT d.*
  FROM deduped d
  LEFT JOIN analytics.events t
    ON t.event_id = d.event_id
  WHERE t.event_id IS NULL
)
INSERT INTO analytics.events PARTITION BY DATE(event_time)
SELECT * FROM new_only;

Why it works:

Append-only: existing rows are not updated.
Dedup by event_id prevents duplicates.
Partition by event date for pruning.

Example 3: SCD Type 2 for customer dimension

Goal: Track changes to customer_address over time with effective_from, effective_to, is_current.

-- Source has possibly changed attributes with updated_at
WITH src AS (
  SELECT customer_id, name, address, updated_at AS change_ts
  FROM staging.customers
), latest_src AS (
  -- One row per customer_id from the latest change
  SELECT * FROM (
    SELECT s.*,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY change_ts DESC) AS rn
    FROM src s
  ) x WHERE rn = 1
)
-- 1) Close out current rows where attributes changed
UPDATE analytics.dim_customer t
SET effective_to = l.change_ts, is_current = FALSE
FROM latest_src l
WHERE t.customer_id = l.customer_id
  AND t.is_current = TRUE
  AND (t.name     IS DISTINCT FROM l.name
    OR t.address  IS DISTINCT FROM l.address);

-- 2) Insert new current rows for changed or brand-new customers
INSERT INTO analytics.dim_customer (
  customer_id, name, address, effective_from, effective_to, is_current
)
SELECT l.customer_id, l.name, l.address, l.change_ts, TIMESTAMP '9999-12-31', TRUE
FROM latest_src l
LEFT JOIN analytics.dim_customer t
  ON t.customer_id = l.customer_id AND t.is_current = TRUE
WHERE t.customer_id IS NULL -- new customer
   OR (t.name IS DISTINCT FROM l.name OR t.address IS DISTINCT FROM l.address);

Why it works:

Update closes the old row’s validity window.
Insert adds a new current row.
Idempotent if the same change_ts is processed again.

Performance and reliability tips

Prune scans: filter slices by time and use partitions/clustering on that column.
Reduce rows early: deduplicate in staging CTEs.
Keep lookback small but safe: typical 1–3 days; adjust to your source latency.
Batch keys: for very late data, combine time-based slice with a list of affected keys.
Add uniqueness checks on keys to catch accidental duplicates.

Common mistakes and self-checks

Mistake: Using created_at as watermark when updates happen later. Self-check: Compare max(updated_at) in source vs target; ensure late updates appear.
Mistake: No dedup in the slice. Self-check: Count duplicates by key in the slice with GROUP BY HAVING COUNT(*) > 1.
Mistake: No lookback window. Self-check: Sample late-arriving records and verify they load.
Mistake: Non-idempotent logic (e.g., summing partial aggregates). Self-check: Re-run the same date slice twice; table should not change.
Mistake: Missing WHERE rn=1 after windowing. Self-check: Ensure window function filters to a single winner per key.

Exercises

Do these in any SQL warehouse (or reason through in a notebook). A quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Orders incremental upsert
Design a MERGE-based incremental load for orders using a 2-day lookback and updated_at.

Show sample data

staging.orders
order_id | customer_id | status   | amount | created_at          | updated_at
1        | 10          | placed   | 25.00  | 2025-01-01 10:00:00 | 2025-01-01 10:00:00
2        | 11          | placed   | 18.00  | 2025-01-01 10:05:00 | 2025-01-01 10:05:00
2        | 11          | shipped  | 18.00  | 2025-01-01 10:05:00 | 2025-01-02 06:00:00
3        | 12          | placed   | 40.00  | 2025-01-02 07:00:00 | 2025-01-02 07:00:00

Exercise 2 — SCD2 for customers
Implement an SCD2 dimension that tracks name/address changes using effective_from, effective_to, is_current.

Show sample data

staging.customers
customer_id | name      | address          | updated_at
7           | Dana Lee  | 12 Pine St       | 2025-02-01 08:00:00
7           | Dana Lee  | 99 Oak Ave       | 2025-02-05 09:00:00
8           | Kyle Roy  | 77 Elm Blvd      | 2025-02-03 12:00:00

Exercise checklist

[ ] Uses a unique key per row.
[ ] Has a safe watermark and lookback.
[ ] Deduplicates within the slice.
[ ] Upserts (MERGE or UPDATE+INSERT) idempotently.
[ ] Validates with counts and uniqueness tests.

Mini challenge

Pick one scenario and outline an incremental plan:

High-volume pageviews with occasional duplicate event_ids.
Payments where status may change from pending to settled days later.
Product catalog where price changes must maintain history (SCD2).

For each: define the key, change signal, lookback, dedup logic, and merge/insert approach.

Learning path

Before: Staging and data cleaning patterns.
Now: Incremental models and materializations (this page).
Next: Partitioning & clustering strategies; SCD types; Orchestration & scheduling; Data quality tests.

Practical projects

Build an incremental orders fact with a daily schedule and backfill script for the last 30 days.
Create an SCD2 customer dimension and validate with change audits.
Ingest clickstream events with partitioning and deduplication; measure cost savings vs full refresh.

Next steps

Harden your models with uniqueness and freshness checks.
Parameterize lookback windows and add guardrails for backfills.
Run the quick test below to check your understanding.

Menu

Incremental Models And Materializations

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Materializations at a glance

Designing a safe incremental model

Worked examples

Example 1: Incremental fact_orders with MERGE

Example 2: Append-only events, dedupe + insert

Example 3: SCD Type 2 for customer dimension

Performance and reliability tips

Common mistakes and self-checks

Exercises

Exercise checklist

Mini challenge

Learning path

Practical projects

Next steps

Practice Exercises

Orders incremental upsert

Instructions

Expected Output

SCD2 customer dimension

Incremental Models And Materializations — Quick Test

Have questions about Incremental Models And Materializations?

AI Assistant