How to learn Working With Incremental Loads for SQL in Analytics Engineer for free

Why this matters

As an Analytics Engineer, you often move data from raw sources into clean, analytics-ready tables. Full refreshes can be slow, expensive, and risky. Incremental loads let you process only new or changed rows, keep tables fresh within minutes, and support upserts and deletes—all while staying cost-effective.

Daily job: Upsert yesterday's order changes into a fact table.
Near-real-time need: Load the last 15 minutes of events with overlap to handle late data.
Governance: Track a high-watermark so reruns are safe and idempotent.

Concept explained simply

Incremental loading means you only pull the slice of data that changed since the last successful run (the watermark), deduplicate it, and merge it into the target table.

Mental model

Think of a bookmark in a long book. Each run, you start reading at your bookmark minus a small overlap, re-read a few lines to be safe, then move the bookmark forward. The overlap prevents you missing words that were printed late.

Bookmark = high-watermark (e.g., max updated_at processed).
Overlap window = safety margin (e.g., 10 minutes) to capture late-arriving rows.
Re-reading = deduping so duplicates don't accumulate.

Core patterns

Timestamp-based incremental (most common)

Filter source by updated_at >= watermark - overlap. In staging, dedupe by business key, keeping the latest row. Then MERGE into target.

-- Stage slice with overlap and dedupe
generate_slice as (
  select *
  from source
  where updated_at >= :last_watermark - interval '10 minute'
),
latest_per_id as (
  select s.*,
         row_number() over(partition by id order by updated_at desc) as rn
  from generate_slice s
)
select * from latest_per_id where rn = 1;

Key-based incremental when timestamps are unreliable

Use a hash/checksum over meaningful columns. If hash changed for a key, treat as an update.

select id,
       md5(coalesce(status,'') || '|' || coalesce(amount::text,'')) as row_hash,
       *
from daily_extract;

CDC (Change Data Capture)

Some sources provide inserts, updates, and deletes (tombstones). Use MERGE to apply them. Ensure ordering by commit timestamp and idempotency.

Handling deletes

Implement soft deletes (is_deleted flag) or hard deletes via MERGE WHEN MATCHED AND source.is_deleted THEN DELETE. Keep audit columns.

Idempotency & reruns

Use deterministic filters and dedupe logic so rerunning the same slice produces the same result.
Advance the watermark only after a successful MERGE and validation.

Late-arriving data

Apply an overlap (e.g., 5–30 minutes) in filters.
Dedupe with window functions to avoid duplicates from overlap.

Partitioning & performance

Partition/cluster target tables by date or key to speed MERGE.
Use narrow slices; avoid scanning entire history.

Worked examples

Example 1: Timestamp-based upsert with MERGE

Goal: Upsert orders by id using updated_at as the watermark.

-- Inputs: source.orders_raw(id, status, amount, updated_at)
-- Target: analytics.dim_orders(id primary key, status, amount, updated_at)
with slice as (
  select *
  from source.orders_raw
  where updated_at >= :last_watermark - interval '10 minute'
), latest as (
  select *, row_number() over(partition by id order by updated_at desc) rn
  from slice
)
merge into analytics.dim_orders t
using (select * from latest where rn = 1) s
   on t.id = s.id
when matched and s.updated_at > t.updated_at then
  update set status = s.status,
             amount = s.amount,
             updated_at = s.updated_at
when not matched then
  insert (id, status, amount, updated_at)
  values (s.id, s.status, s.amount, s.updated_at);
-- After a successful run, set :last_watermark = max(s.updated_at)

Example 2: Deduplicate by business key

Goal: Keep latest customer record per email, in case of multiple rows in the slice.

with slice as (
  select * from landing.customers_daily
  where file_date = :run_date
), latest as (
  select *, row_number() over(
    partition by lower(trim(email))
    order by coalesce(updated_at, ingested_at) desc
  ) rn
  from slice
)
select * from latest where rn = 1;

Example 3: Applying deletes via tombstones

Goal: Remove rows marked deleted in CDC.

merge into analytics.dim_products t
using (
  select product_id, is_deleted, updated_at, name
  from cdc.products
  where commit_ts >= :last_watermark - interval '10 minute'
) s
on t.product_id = s.product_id
when matched and s.is_deleted = true then delete
when matched and s.updated_at > t.updated_at and coalesce(s.is_deleted,false) = false then
  update set name = s.name, updated_at = s.updated_at
when not matched and coalesce(s.is_deleted,false) = false then
  insert (product_id, name, updated_at)
  values (s.product_id, s.name, s.updated_at);

Step-by-step implementation

Choose the change detector: updated_at, CDC commit_ts, or a row hash.
Define overlap: e.g., 10 minutes for eventual consistency.
Stage slice: Filter by watermark - overlap.
Dedupe: Use row_number() over business key ordered by freshness.
MERGE: Upsert and delete as needed.
Validate: Compare counts, spot-check a few keys.
Advance watermark: Set to max(change detector) from processed slice.

Exercises

Note: The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Build a timestamp-based incremental upsert

Source: raw_orders(id INT, status TEXT, amount NUMERIC, updated_at TIMESTAMP)

Target: dim_orders(id INT, status TEXT, amount NUMERIC, updated_at TIMESTAMP)

Given last_watermark, write SQL to:

Stage a 10-minute-overlap slice.
Dedupe to the latest row per id.
MERGE into dim_orders, updating only when source is newer.

Hints

Use row_number() for dedupe.
Use WHEN MATCHED and a timestamp comparison in MERGE.

Exercise 2 — Key-based incremental with hash

Source: customers_daily(id INT, email TEXT, name TEXT, city TEXT, ingested_at TIMESTAMP)

Target: dim_customers(id INT, email TEXT, name TEXT, city TEXT, row_hash TEXT)

Timestamps may be unreliable. Write SQL that:

Computes row_hash from email, name, city.
Builds a list of changed keys by comparing hashes to target.
MERGEs only changed rows.

Hints

Use md5(lower(trim(email)) || '|' || lower(name) || '|' || lower(city)).
Left join slice to target on id to detect changes.

Self-check checklist

Dedupe keeps only the latest row per key in the slice.
MERGE only updates when source is newer or hash changed.
Watermark advances to the max processed value.
Overlap is applied to prevent missed late rows.

Common mistakes and how to self-check

No overlap: Risk of missing late rows. Fix: use 5–30 minute overlap.
Updating regardless of recency: Can overwrite with stale data. Fix: compare timestamps or hashes.
Skipping dedupe: Duplicates from multiple files or overlap. Fix: window function row_number().
Advancing watermark too early: If MERGE fails, you can lose data. Fix: advance only after validation success.
Timezone mismatch: Source and warehouse differ. Fix: normalize to UTC before comparisons.
Large MERGE scans: Slow and costly. Fix: partition/cluster and prefilter source keys.

Practical projects

Orders pipeline: Build a daily incremental upsert for orders with 10-minute overlap and a watermark table.
Customer SCD1: Upsert customer attributes using a row hash and detect changes without reliable updated_at.
CDC deletes: Apply delete tombstones from a CDC feed to a product dimension.

Who this is for

Analytics Engineers implementing data models in a warehouse.
BI Developers needing timely, cost-effective refreshes.
Data Engineers building reliable ELT workflows.

Prerequisites

Comfort with SQL joins, window functions, and basic DDL.
Understanding of primary keys and uniqueness.
Basic knowledge of your warehouse MERGE syntax.

Learning path

Window functions for dedupe.
MERGE patterns for upserts and deletes.
Watermarks and idempotent scheduling.
Optimization: partitions, clustering, and prune keys.

Next steps

Parameterize your SQL (watermark and overlap as inputs).
Add row-count and freshness validations post-merge.
Document the watermark table and recovery steps.

Mini challenge

Design an incremental load for a web_events table that arrives out-of-order up to 20 minutes and sometimes emits duplicate event_ids. Describe your overlap window, dedupe key(s), and how you would update the watermark.

Menu

Working With Incremental Loads

Table of Contents