Why this matters
Incremental models let you process only new or changed data instead of rebuilding entire tables. This keeps pipelines fast, costs low, and datasets fresh. As a Data Engineer, you will:
- Load large facts (orders, events) without full refreshes.
- Upsert corrections and late-arriving data safely.
- Maintain Slowly Changing Dimensions (SCD) for accurate history.
- Choose the right materialization (view, table, incremental) to balance freshness, cost, and reliability.
Who this is for
- Data Engineers and Analytics Engineers working with SQL warehouses.
- Anyone using tools that support incremental builds (e.g., SQL MERGE-based pipelines).
Prerequisites
- Comfortable with SQL (JOINs, window functions, INSERT/UPDATE/MERGE).
- Basic data modeling: facts, dimensions, keys.
- Familiar with your warehouse’s partitioning/cluster and transaction behavior.
Concept explained simply
An incremental model stores its previous results and, on each run, only processes a slice (new/updated rows). That slice is merged into the target table. This makes runs faster and cheaper than full rebuilds.
Mental model
- Think of a library that only shelves new books and replaces corrected copies—no need to re-shelve the whole library each day.
- Two pillars: a trustworthy key (to match rows) and a trustworthy change signal (timestamp or version) to know what to process.
- Idempotency: re-running the same slice shouldn’t duplicate or corrupt data.
Materializations at a glance
- View: always computes on read. Great for tiny, simple refs; not ideal for heavy aggregations.
- Table: computed once and stored. Great when stability/speed matters and data isn’t huge or super-fresh.
- Incremental table: stored plus only new/changed rows processed each run. Ideal for large, regularly updated datasets.
Designing a safe incremental model
- Pick keys: a unique identifier per row (e.g., order_id, event_id).
- Pick a change detector: updated_at or event_time. If absent, derive a hash/version.
- Stage and clean: deduplicate the slice using window functions or hashes.
- Slice safely: filter the source to a time window using a watermark + a small lookback (e.g., 1–3 days) to catch late events.
- Merge strategy: MERGE (upsert) where supported; else transaction with UPDATE then INSERT.
- Partitions/clustering: align with time columns used in slicing to prune reads.
- Idempotency: ensure re-running the same slice yields the same result.
- Validation: row counts per day, uniqueness, and change rates.
Worked examples
Example 1: Incremental fact_orders with MERGE
Goal: Upsert orders by order_id using updated_at with a 2-day lookback.
WITH watermark AS (
SELECT COALESCE(MAX(updated_at), TIMESTAMP '1900-01-01') AS last_ts
FROM analytics.fact_orders
), raw_slice AS (
SELECT *
FROM staging.orders
WHERE updated_at >= (SELECT last_ts - INTERVAL '2' DAY FROM watermark)
), deduped AS (
SELECT *
FROM (
SELECT o.*,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM raw_slice o
) x
WHERE rn = 1
)
MERGE INTO analytics.fact_orders t
USING deduped s
ON t.order_id = s.order_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET
customer_id = s.customer_id,
status = s.status,
amount = s.amount,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, customer_id, status, amount, created_at, updated_at)
VALUES (s.order_id, s.customer_id, s.status, s.amount, s.created_at, s.updated_at);
Why it works:
- Lookback captures late updates.
- Dedup ensures one row per order_id from the slice.
- MERGE updates changed rows and inserts new ones.
Example 2: Append-only events, dedupe + insert
Goal: Ingest clickstream events by event_id, append only, avoid duplicates.
WITH watermark AS (
SELECT COALESCE(MAX(event_time), TIMESTAMP '1900-01-01') AS last_ts
FROM analytics.events
), raw_slice AS (
SELECT *
FROM landing.events
WHERE event_time >= (SELECT last_ts - INTERVAL '1' DAY FROM watermark)
), deduped AS (
SELECT * FROM (
SELECT e.*,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time DESC) AS rn
FROM raw_slice e
) z WHERE rn = 1
), new_only AS (
SELECT d.*
FROM deduped d
LEFT JOIN analytics.events t
ON t.event_id = d.event_id
WHERE t.event_id IS NULL
)
INSERT INTO analytics.events PARTITION BY DATE(event_time)
SELECT * FROM new_only;
Why it works:
- Append-only: existing rows are not updated.
- Dedup by event_id prevents duplicates.
- Partition by event date for pruning.
Example 3: SCD Type 2 for customer dimension
Goal: Track changes to customer_address over time with effective_from, effective_to, is_current.
-- Source has possibly changed attributes with updated_at
WITH src AS (
SELECT customer_id, name, address, updated_at AS change_ts
FROM staging.customers
), latest_src AS (
-- One row per customer_id from the latest change
SELECT * FROM (
SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY change_ts DESC) AS rn
FROM src s
) x WHERE rn = 1
)
-- 1) Close out current rows where attributes changed
UPDATE analytics.dim_customer t
SET effective_to = l.change_ts, is_current = FALSE
FROM latest_src l
WHERE t.customer_id = l.customer_id
AND t.is_current = TRUE
AND (t.name IS DISTINCT FROM l.name
OR t.address IS DISTINCT FROM l.address);
-- 2) Insert new current rows for changed or brand-new customers
INSERT INTO analytics.dim_customer (
customer_id, name, address, effective_from, effective_to, is_current
)
SELECT l.customer_id, l.name, l.address, l.change_ts, TIMESTAMP '9999-12-31', TRUE
FROM latest_src l
LEFT JOIN analytics.dim_customer t
ON t.customer_id = l.customer_id AND t.is_current = TRUE
WHERE t.customer_id IS NULL -- new customer
OR (t.name IS DISTINCT FROM l.name OR t.address IS DISTINCT FROM l.address);
Why it works:
- Update closes the old row’s validity window.
- Insert adds a new current row.
- Idempotent if the same change_ts is processed again.
Performance and reliability tips
- Prune scans: filter slices by time and use partitions/clustering on that column.
- Reduce rows early: deduplicate in staging CTEs.
- Keep lookback small but safe: typical 1–3 days; adjust to your source latency.
- Batch keys: for very late data, combine time-based slice with a list of affected keys.
- Add uniqueness checks on keys to catch accidental duplicates.
Common mistakes and self-checks
- Mistake: Using created_at as watermark when updates happen later. Self-check: Compare max(updated_at) in source vs target; ensure late updates appear.
- Mistake: No dedup in the slice. Self-check: Count duplicates by key in the slice with GROUP BY HAVING COUNT(*) > 1.
- Mistake: No lookback window. Self-check: Sample late-arriving records and verify they load.
- Mistake: Non-idempotent logic (e.g., summing partial aggregates). Self-check: Re-run the same date slice twice; table should not change.
- Mistake: Missing WHERE rn=1 after windowing. Self-check: Ensure window function filters to a single winner per key.
Exercises
Do these in any SQL warehouse (or reason through in a notebook). A quick test is available to everyone; only logged-in users get saved progress.
- Exercise 1 — Orders incremental upsert
Design a MERGE-based incremental load for orders using a 2-day lookback and updated_at.Show sample data
staging.orders order_id | customer_id | status | amount | created_at | updated_at 1 | 10 | placed | 25.00 | 2025-01-01 10:00:00 | 2025-01-01 10:00:00 2 | 11 | placed | 18.00 | 2025-01-01 10:05:00 | 2025-01-01 10:05:00 2 | 11 | shipped | 18.00 | 2025-01-01 10:05:00 | 2025-01-02 06:00:00 3 | 12 | placed | 40.00 | 2025-01-02 07:00:00 | 2025-01-02 07:00:00 - Exercise 2 — SCD2 for customers
Implement an SCD2 dimension that tracks name/address changes using effective_from, effective_to, is_current.Show sample data
staging.customers customer_id | name | address | updated_at 7 | Dana Lee | 12 Pine St | 2025-02-01 08:00:00 7 | Dana Lee | 99 Oak Ave | 2025-02-05 09:00:00 8 | Kyle Roy | 77 Elm Blvd | 2025-02-03 12:00:00
Exercise checklist
- [ ] Uses a unique key per row.
- [ ] Has a safe watermark and lookback.
- [ ] Deduplicates within the slice.
- [ ] Upserts (MERGE or UPDATE+INSERT) idempotently.
- [ ] Validates with counts and uniqueness tests.
Mini challenge
Pick one scenario and outline an incremental plan:
- High-volume pageviews with occasional duplicate event_ids.
- Payments where status may change from pending to settled days later.
- Product catalog where price changes must maintain history (SCD2).
For each: define the key, change signal, lookback, dedup logic, and merge/insert approach.
Learning path
- Before: Staging and data cleaning patterns.
- Now: Incremental models and materializations (this page).
- Next: Partitioning & clustering strategies; SCD types; Orchestration & scheduling; Data quality tests.
Practical projects
- Build an incremental orders fact with a daily schedule and backfill script for the last 30 days.
- Create an SCD2 customer dimension and validate with change audits.
- Ingest clickstream events with partitioning and deduplication; measure cost savings vs full refresh.
Next steps
- Harden your models with uniqueness and freshness checks.
- Parameterize lookback windows and add guardrails for backfills.
- Run the quick test below to check your understanding.