Why this matters
Incremental models let you process only new or changed data instead of rebuilding everything. Pruning ensures your queries scan only the relevant partitions or files. Together, they slash compute costs and runtime while keeping pipelines reliable.
- Real tasks you will do: load daily events into a fact table, upsert product attributes, handle late-arriving records, backfill a specific date range, and keep queries fast by filtering on partitions.
- Results: faster transformations, predictable SLAs, and lower warehouse bills.
Concept explained simply
Think of your data like a library. Incremental models shelve only the new or edited books; pruning tells the librarian exactly which shelf and section to search so you don’t scan the whole building.
Mental model
- Incremental = process delta only: new rows or changed rows since a "watermark" (highest timestamp or ID you have processed).
- Pruning = limit scan early: filter by partitioned columns (like event_date) and avoid functions that hide the filter from the optimizer.
Key terms
- High-water mark (HWM): the last processed value (e.g., max(updated_at) in your target). Used to pick the next batch.
- Grace window: small overlap (e.g., 24 hours) to catch late-arriving rows.
- Partition pruning: scanning only partitions that match a simple predicate (e.g., event_date between '2025-01-01' and '2025-01-02').
- File/cluster pruning: skipping files or clusters that cannot contain matching rows due to statistics or ordering.
- Upsert: insert new rows and update changed rows based on a unique key.
Core patterns you will use
1) Incremental for append-only events
- Identify a partition column (event_date) and a watermark column (event_time).
- Each run: load rows where event_time >= HWM - grace_window and event_date in that timeframe.
- Deduplicate by unique key if duplicates are possible.
2) Incremental upsert (dimensions)
- Stage changed rows (updated since HWM - grace_window).
- MERGE into the target using a stable unique key (e.g., product_id).
- Partition by updated_date if the table is large for better pruning.
3) Snapshot + dedup for facts with repeats
- Union new rows with existing, then choose the latest per unique key via window functions.
- Ideal when source can resend the same event with later corrections.
4) Late-arriving data
- Use a rolling grace window (e.g., 1–3 days) around HWM.
- For very late rows, support targeted backfills for affected dates.
5) Backfills
- Do partial rebuilds by date range instead of a full refresh.
- Invalidate only affected partitions; keep the rest intact.
Worked examples (3)
Example 1 — Append-only events with watermark and pruning
Goal: Load only new events into fact_events and keep scans small.
-- Assume target fact_events is partitioned by event_date (DATE(event_time))
-- 1) Get current high-water mark from target
WITH meta AS (
SELECT COALESCE(MAX(event_time), TIMESTAMP '1970-01-01 00:00:00') AS hwm
FROM fact_events
),
-- 2) Define a grace window (e.g., 24h)
windowed_source AS (
SELECT *
FROM staging_events s, meta m
WHERE s.event_time >= m.hwm - INTERVAL '1' DAY
AND DATE(s.event_time) BETWEEN DATE(m.hwm - INTERVAL '1' DAY) AND CURRENT_DATE
)
-- 3) Deduplicate within the batch by event_id and take the latest
INSERT INTO fact_events
SELECT * EXCEPT(rn)
FROM (
SELECT s.*, ROW_NUMBER() OVER (PARTITION BY s.event_id ORDER BY s.event_time DESC) rn
FROM windowed_source s
)
WHERE rn = 1;Why it works: The DATE() filter enables partition pruning. The grace window captures late rows without rescanning entire history.
Example 2 — Dimension upsert with MERGE and rolling window
Goal: Keep dim_product up to date with changed attributes.
-- 1) Find HWM on updated_at in the target
WITH meta AS (
SELECT COALESCE(MAX(updated_at), TIMESTAMP '1970-01-01 00:00:00') AS hwm
FROM dim_product
),
changes AS (
SELECT *
FROM staging_product s, meta m
WHERE s.updated_at >= m.hwm - INTERVAL '1' DAY
AND DATE(s.updated_at) BETWEEN DATE(m.hwm - INTERVAL '1' DAY) AND CURRENT_DATE
)
-- 2) Upsert using a stable unique key
MERGE INTO dim_product t
USING (
SELECT c.* EXCEPT(rn)
FROM (
SELECT c.*, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY updated_at DESC) rn
FROM changes c
) c
WHERE rn = 1
) s
ON t.product_id = s.product_id
WHEN MATCHED THEN UPDATE SET
name = s.name,
category = s.category,
price = s.price,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (
product_id, name, category, price, updated_at
) VALUES (
s.product_id, s.name, s.category, s.price, s.updated_at
);
Why it works: Only the partitions overlapping the rolling window are scanned; MERGE guarantees correct updates/inserts.
Example 3 — Fact table dedup + clustering/file pruning
Goal: Build fact_orders incrementally where source can resend corrections.
-- Partition by order_date; optionally cluster/sort by customer_id, order_id
WITH meta AS (
SELECT COALESCE(MAX(order_ts), TIMESTAMP '1970-01-01 00:00:00') AS hwm
FROM fact_orders
),
win_src AS (
SELECT *
FROM staging_orders s, meta m
WHERE s.order_ts >= m.hwm - INTERVAL '2' DAY
AND DATE(s.order_ts) BETWEEN DATE(m.hwm - INTERVAL '2' DAY) AND CURRENT_DATE
),
-- Deduplicate by order_id taking the most recent order_ts
ranked AS (
SELECT w.*, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_ts DESC) rn
FROM win_src w
)
-- Replace affected partitions via MERGE for idempotency
MERGE INTO fact_orders t
USING (SELECT * FROM ranked WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
order_ts = s.order_ts,
order_date = DATE(s.order_ts),
amount = s.amount,
status = s.status
WHEN NOT MATCHED THEN INSERT (
order_id, order_ts, order_date, amount, status
) VALUES (
s.order_id, s.order_ts, DATE(s.order_ts), s.amount, s.status
);
Why it works: Partition filters prune old dates. Ranking ensures one row per order_id. MERGE keeps the process idempotent.
How to implement reliably
- Pick a stable unique key per table (e.g., event_id, product_id). If none exists, create a hash key from stable fields.
- Define HWM on a column that truly moves forward (updated_at is better than created_at for mutable records).
- Always include a grace window to catch late data.
- Write predicates on partition columns without wrapping them in functions that hide the column (use DATE(col) at write time to populate a partition column, then filter on that partition column).
- Validate with row counts, min/max timestamps, duplicate checks, and sample spot checks.
Self-check checklist
- Do queries filter on partition columns directly (e.g., event_date) with simple comparisons?
- Does the HWM move forward each run and never go backward?
- Is there a grace window, and is it large enough to catch typical late arrivals?
- Can the job be re-run without duplicating rows (idempotent)?
- Is the unique key truly unique across the batch and history?
Common mistakes and how to self-check
- Using MAX(created_at) as HWM for mutable data. Fix: use updated_at or a change version column.
- No grace window, missing late-arriving rows. Fix: subtract 1–3 days from HWM depending on source behavior.
- Breaking partition pruning by wrapping partition columns in functions in the WHERE clause. Fix: filter directly on the partition column with simple comparisons.
- Duplicates after incremental loads. Fix: deduplicate within each batch using ROW_NUMBER() over the unique key.
- Non-deterministic merges (no stable key). Fix: define a stable unique key or hash of immutable fields.
- Overly broad backfills (full refresh). Fix: rebuild only affected partitions or dates.
Quick self-audit snippet
-- Count duplicates by unique key in the target
SELECT unique_key, COUNT(*) c
FROM target
GROUP BY unique_key
HAVING COUNT(*) > 1;
-- Validate pruning: limit to recent partitions
EXPLAIN SELECT * FROM target WHERE event_date BETWEEN CURRENT_DATE - 2 AND CURRENT_DATE;
-- Ensure the plan shows partition pruning / restricted scan.
Exercises
Complete these in a sandbox schema or with small sample tables. They mirror the exercises below so you can compare your work.
Exercise 1 — Append-only incremental with watermark
- Create a staging_events table with columns: event_id (string), event_time (timestamp), user_id (string), amount (numeric).
- Create a target fact_events partitioned by event_date (DATE).
- Load only rows where event_time >= HWM - 1 day, deduplicate by event_id, and insert.
- Verify pruning by filtering on event_date in your INSERT SELECT.
Checklist to verify
- Target row count increases only by new events.
- No duplicate event_id exists.
- Scan bytes/rows drop when date filters are applied.
Exercise 2 — Rolling-window MERGE upsert
- Create staging_product with product_id, name, category, price, updated_at.
- Create dim_product partitioned by updated_date.
- Implement MERGE using updated_at HWM - 1 day and dedup by product_id within the batch.
- Re-run with changed rows and confirm only changed rows are updated.
Checklist to verify
- Only products changed since the last run are scanned.
- Idempotent re-run produces no extra changes.
- Plan shows partition pruning on updated_date.
Practical projects
- Project A: Build an incremental events pipeline ingesting 30 days of clickstream data with a 2-day grace window, including a backfill job for a known outage date.
- Project B: Create an incremental dim_customer with a MERGE that updates loyalty_status and price_segment. Add data quality checks for duplicate customer_id and backward HWM movement.
- Project C: Optimize a large fact_sales table by adding a partition column, clustering key, and refactoring queries to ensure pruning.
Learning path
- First: Understand partitioning and basic SQL filtering.
- Next: Implement incremental loads on append-only tables.
- Then: Add MERGE-based upserts for dimension tables.
- Finally: Add backfill workflows and monitor performance (scanned bytes/partitions).
Who this is for
- Analytics Engineers and BI Developers who maintain ELT pipelines.
- Data Engineers improving warehouse performance.
- Analysts owning productionized models.
Prerequisites
- Comfort with SQL (JOINs, window functions, CASE).
- Basic understanding of partitions/clustered tables.
- Familiarity with INSERT/SELECT and MERGE semantics.
Next steps
- Implement one incremental model in your environment today.
- Add a grace window and test a backfill for a single day.
- Run the quick test below to check your mastery.
Mini challenge
You discover a vendor resend for the past 3 days. Update your incremental logic to rebuild only those three partitions, remain idempotent, and ensure no duplicates, all while keeping pruning effective.
Test availability note: The quick test is available to everyone. Only logged-in users get saved progress.