Why this matters
As an Analytics Engineer, you often move data from raw sources into clean, analytics-ready tables. Full refreshes can be slow, expensive, and risky. Incremental loads let you process only new or changed rows, keep tables fresh within minutes, and support upserts and deletes—all while staying cost-effective.
- Daily job: Upsert yesterday's order changes into a fact table.
- Near-real-time need: Load the last 15 minutes of events with overlap to handle late data.
- Governance: Track a high-watermark so reruns are safe and idempotent.
Concept explained simply
Incremental loading means you only pull the slice of data that changed since the last successful run (the watermark), deduplicate it, and merge it into the target table.
Mental model
Think of a bookmark in a long book. Each run, you start reading at your bookmark minus a small overlap, re-read a few lines to be safe, then move the bookmark forward. The overlap prevents you missing words that were printed late.
- Bookmark = high-watermark (e.g., max updated_at processed).
- Overlap window = safety margin (e.g., 10 minutes) to capture late-arriving rows.
- Re-reading = deduping so duplicates don't accumulate.
Core patterns
Timestamp-based incremental (most common)
Filter source by updated_at >= watermark - overlap. In staging, dedupe by business key, keeping the latest row. Then MERGE into target.
-- Stage slice with overlap and dedupe
generate_slice as (
select *
from source
where updated_at >= :last_watermark - interval '10 minute'
),
latest_per_id as (
select s.*,
row_number() over(partition by id order by updated_at desc) as rn
from generate_slice s
)
select * from latest_per_id where rn = 1;
Key-based incremental when timestamps are unreliable
Use a hash/checksum over meaningful columns. If hash changed for a key, treat as an update.
select id,
md5(coalesce(status,'') || '|' || coalesce(amount::text,'')) as row_hash,
*
from daily_extract;
CDC (Change Data Capture)
Some sources provide inserts, updates, and deletes (tombstones). Use MERGE to apply them. Ensure ordering by commit timestamp and idempotency.
Handling deletes
Implement soft deletes (is_deleted flag) or hard deletes via MERGE WHEN MATCHED AND source.is_deleted THEN DELETE. Keep audit columns.
Idempotency & reruns
- Use deterministic filters and dedupe logic so rerunning the same slice produces the same result.
- Advance the watermark only after a successful MERGE and validation.
Late-arriving data
- Apply an overlap (e.g., 5–30 minutes) in filters.
- Dedupe with window functions to avoid duplicates from overlap.
Partitioning & performance
- Partition/cluster target tables by date or key to speed MERGE.
- Use narrow slices; avoid scanning entire history.
Worked examples
Example 1: Timestamp-based upsert with MERGE
Goal: Upsert orders by id using updated_at as the watermark.
-- Inputs: source.orders_raw(id, status, amount, updated_at)
-- Target: analytics.dim_orders(id primary key, status, amount, updated_at)
with slice as (
select *
from source.orders_raw
where updated_at >= :last_watermark - interval '10 minute'
), latest as (
select *, row_number() over(partition by id order by updated_at desc) rn
from slice
)
merge into analytics.dim_orders t
using (select * from latest where rn = 1) s
on t.id = s.id
when matched and s.updated_at > t.updated_at then
update set status = s.status,
amount = s.amount,
updated_at = s.updated_at
when not matched then
insert (id, status, amount, updated_at)
values (s.id, s.status, s.amount, s.updated_at);
-- After a successful run, set :last_watermark = max(s.updated_at)
Example 2: Deduplicate by business key
Goal: Keep latest customer record per email, in case of multiple rows in the slice.
with slice as (
select * from landing.customers_daily
where file_date = :run_date
), latest as (
select *, row_number() over(
partition by lower(trim(email))
order by coalesce(updated_at, ingested_at) desc
) rn
from slice
)
select * from latest where rn = 1;
Example 3: Applying deletes via tombstones
Goal: Remove rows marked deleted in CDC.
merge into analytics.dim_products t
using (
select product_id, is_deleted, updated_at, name
from cdc.products
where commit_ts >= :last_watermark - interval '10 minute'
) s
on t.product_id = s.product_id
when matched and s.is_deleted = true then delete
when matched and s.updated_at > t.updated_at and coalesce(s.is_deleted,false) = false then
update set name = s.name, updated_at = s.updated_at
when not matched and coalesce(s.is_deleted,false) = false then
insert (product_id, name, updated_at)
values (s.product_id, s.name, s.updated_at);
Step-by-step implementation
- Choose the change detector: updated_at, CDC commit_ts, or a row hash.
- Define overlap: e.g., 10 minutes for eventual consistency.
- Stage slice: Filter by watermark - overlap.
- Dedupe: Use row_number() over business key ordered by freshness.
- MERGE: Upsert and delete as needed.
- Validate: Compare counts, spot-check a few keys.
- Advance watermark: Set to max(change detector) from processed slice.
Exercises
Note: The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Build a timestamp-based incremental upsert
Source: raw_orders(id INT, status TEXT, amount NUMERIC, updated_at TIMESTAMP)
Target: dim_orders(id INT, status TEXT, amount NUMERIC, updated_at TIMESTAMP)
Given last_watermark, write SQL to:
- Stage a 10-minute-overlap slice.
- Dedupe to the latest row per id.
- MERGE into dim_orders, updating only when source is newer.
Hints
- Use row_number() for dedupe.
- Use WHEN MATCHED and a timestamp comparison in MERGE.
Exercise 2 — Key-based incremental with hash
Source: customers_daily(id INT, email TEXT, name TEXT, city TEXT, ingested_at TIMESTAMP)
Target: dim_customers(id INT, email TEXT, name TEXT, city TEXT, row_hash TEXT)
Timestamps may be unreliable. Write SQL that:
- Computes row_hash from email, name, city.
- Builds a list of changed keys by comparing hashes to target.
- MERGEs only changed rows.
Hints
- Use md5(lower(trim(email)) || '|' || lower(name) || '|' || lower(city)).
- Left join slice to target on id to detect changes.
Self-check checklist
- Dedupe keeps only the latest row per key in the slice.
- MERGE only updates when source is newer or hash changed.
- Watermark advances to the max processed value.
- Overlap is applied to prevent missed late rows.
Common mistakes and how to self-check
- No overlap: Risk of missing late rows. Fix: use 5–30 minute overlap.
- Updating regardless of recency: Can overwrite with stale data. Fix: compare timestamps or hashes.
- Skipping dedupe: Duplicates from multiple files or overlap. Fix: window function row_number().
- Advancing watermark too early: If MERGE fails, you can lose data. Fix: advance only after validation success.
- Timezone mismatch: Source and warehouse differ. Fix: normalize to UTC before comparisons.
- Large MERGE scans: Slow and costly. Fix: partition/cluster and prefilter source keys.
Practical projects
- Orders pipeline: Build a daily incremental upsert for orders with 10-minute overlap and a watermark table.
- Customer SCD1: Upsert customer attributes using a row hash and detect changes without reliable updated_at.
- CDC deletes: Apply delete tombstones from a CDC feed to a product dimension.
Who this is for
- Analytics Engineers implementing data models in a warehouse.
- BI Developers needing timely, cost-effective refreshes.
- Data Engineers building reliable ELT workflows.
Prerequisites
- Comfort with SQL joins, window functions, and basic DDL.
- Understanding of primary keys and uniqueness.
- Basic knowledge of your warehouse MERGE syntax.
Learning path
- Window functions for dedupe.
- MERGE patterns for upserts and deletes.
- Watermarks and idempotent scheduling.
- Optimization: partitions, clustering, and prune keys.
Next steps
- Parameterize your SQL (watermark and overlap as inputs).
- Add row-count and freshness validations post-merge.
- Document the watermark table and recovery steps.
Mini challenge
Design an incremental load for a web_events table that arrives out-of-order up to 20 minutes and sometimes emits duplicate event_ids. Describe your overlap window, dedupe key(s), and how you would update the watermark.