Topic Not Found

Why this matters

Incremental loads let you move only what changed since the last run. That makes pipelines faster, cheaper, and more reliable. Watermarks are the simple, durable way to remember how far you got.

Keep data fresh on tight SLAs without reloading everything.
Reduce cloud costs and API rate usage.
Recover gracefully from failures by resuming from the last checkpoint.

Concept explained simply

Incremental load = process only new or changed records since the last successful run. A watermark is a stored checkpoint, like the maximum updated_at, last processed ID, or last page token.

Key definition

Watermark: a persisted value representing the highest change you have safely processed (e.g., max event time, max sequence number, or a durable token). After each successful run, you update it.

Mental model

Imagine a river rising. Your watermark is the paint mark on a pier showing the highest level reached. On the next inspection, you only need to look above that mark. To be safe, you might glance a little below the mark (a lookback window) to catch any late splashes.

Core patterns you will use

Timestamp-based: query where updated_at > last_watermark (often with a small overlap/lookback).
Sequence/ID-based: use monotonically increasing IDs or change versions.
CDC-based: read change streams (insert/update/delete) and commit offsets.
File-based: track last processed partition/key and reconcile late files with a lookback over recent partitions.
Upserts (MERGE): idempotent writes so reprocessing doesn’t duplicate data.
Deduplication: use a stable key (e.g., source primary key + change version).
Deletes: handle tombstones or soft-delete flags; if not available, occasional full sync + diff.
Watermark storage: durable place (metadata table, checkpoint file) updated only after a successful load.

Worked examples

1) SQL source with updated_at

Goal: Load only rows changed since the last run, with idempotent upserts.

-- Assume we store watermark in control.watermarks table
-- last_watermark = '2025-06-01 10:00:00'
-- Use 5-minute lookback to catch late updates
WITH src AS (
  SELECT *
  FROM app.public.orders
  WHERE updated_at > (TIMESTAMP :last_watermark - INTERVAL '5 minutes')
),
ranked AS (
  -- Deduplicate by primary key keeping the latest version
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM src
)
MERGE INTO dw.orders t
USING (
  SELECT * FROM ranked WHERE rn = 1
) s
ON (t.order_id = s.order_id)
WHEN MATCHED THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.order_id, s.status, s.amount, s.updated_at);

-- After a fully successful load, compute new_watermark
-- new_watermark = MAX(s.updated_at) from the processed set
-- Persist it to control.watermarks

Notes:

Lookback (overlap) protects against clock skew or slight delays.
Dedup + MERGE ensures idempotency.
Watermark is only advanced after success.

2) API with page tokens and event IDs

Goal: Pull new events incrementally without missing or duplicating.

# Pseudocode
state = load_state()  # contains last_event_id and/or page_token
collected = []

while True:
    resp = api.list_events(since_id=state.last_event_id, page_token=state.page_token)
    collected.extend(resp.items)
    if resp.next_page_token is None:
        break
    state.page_token = resp.next_page_token

# Deduplicate by event_id and keep the latest version if re-delivered
upsert(collected, key='event_id', version_field='updated_at')

# Advance watermark safely
new_last_id = max_event_id(collected)
save_state(last_event_id=new_last_id, page_token=None)

Prefer stable IDs over timestamps if available.
Store both last_event_id and last successful page_token during pagination.
Advance only after successful write.

3) Files in object storage (daily partitions)

Goal: Incrementally process daily folders, handling late-arriving files.

# Example layout: s3://bucket/sales/dt=2025-06-01/part-000.parquet
state = load_state()  # last_processed_partition = '2025-05-29'
lookback_days = 2

partitions = list_partitions(prefix='sales/')  # e.g., sorted dates
start_from = min_date(max_date(state.last_processed_partition), today - lookback_days)
for p in partitions where p >= start_from:
    files = list_files(p)
    for f in files:
        df = read_parquet(f)
        df = deduplicate(df, keys=['order_id'], version='updated_at')
        upsert_into_dw(df, key='order_id', version='updated_at')

# Advance watermark only up to the last fully processed partition
save_state(last_processed_partition = max_processed_partition)

Use a small partition lookback to ingest late files safely.
Upserts make reprocessing safe.
Consider a manifest of processed files to avoid re-reading large partitions.

Exercises

Do these now. Your progress on exercises and the test is available to everyone; only logged-in users get saved progress.

Design an incremental SQL load with watermarks.

Task

Source table: customers(id, email, status, updated_at). Destination: dw.customers. Use a 10-minute lookback, deduplicate by id, and write an idempotent MERGE. Describe how and when you store/update the watermark.
File-based lookback plan.

Task

You receive hourly CSVs in /landing/events/dt=YYYY-MM-DD/hr=HH/. Create a step-by-step plan to process incrementally with a 3-hour lookback, deduplicate by (event_id), and safely advance the watermark.

Self-check checklist

[ ] Watermark is durable and updated only after a successful run.
[ ] There is a lookback/overlap strategy to capture late data.
[ ] Writes are idempotent (MERGE/upsert) and deduplicated.
[ ] Deletes are considered (tombstones, soft-deletes, or periodic full compare).
[ ] Failure recovery resumes from the last saved watermark.

Common mistakes and how to self-check

No overlap window. Fix: add small lookback by time or partitions, then deduplicate.
Advancing watermark too early. Fix: update checkpoint only after successful write/commit.
Assuming timestamps are monotonic. Fix: prefer sequence IDs or add dedup + lookback for clock skew.
No deduplication. Fix: keep stable keys and last-write-wins logic.
Ignoring deletes. Fix: process CDC tombstones or schedule periodic reconciliation.
Single point of failure for state. Fix: store watermarks in reliable metadata storage with concurrency control.

Quick self-audit

[ ] Can I rerun yesterday’s job without creating duplicates?
[ ] If the job fails mid-run, do I know exactly where to resume?
[ ] Can I explain how late-arriving events are captured?

Practical projects

Build an incremental pipeline from a sample SQL DB to a warehouse using MERGE and a watermark table.
Create a file-based ingestion flow with a 2-day lookback and a processed-file manifest.
Prototype a CDC consumer that commits offsets and supports replay without duplication.

Mini challenge

Your API sometimes replays the last 100 events on each call. In one paragraph, describe how you will guarantee no duplicates and never miss an event. Mention your watermark, dedup logic, and when you advance the watermark.

Hint

Use last processed event ID as the watermark, allow overlap by always asking since last_id minus a small buffer, dedup by event_id, advance after successful upsert.

Who this is for

Aspiring and practicing Data Engineers building reliable ingestion pipelines.
Analytics Engineers who need faster, cheaper data loads.
Developers designing data movement between systems.

Prerequisites

Comfort with SQL (SELECT, JOIN, and MERGE/UPSERT).
Basic understanding of batch vs. streaming loads.
Familiarity with file formats (CSV/Parquet) and object storage.

Learning path

Review incremental vs. full loads and why watermarks matter.
Practice timestamp- and ID-based patterns with lookbacks.
Add idempotent writes (MERGE/upsert) and dedup.
Handle deletes and late-arriving data.
Harden state management and recovery.

Next steps

Apply incremental loads to one real source today (even a small table).
Introduce a lookback in your current pipeline and measure correctness.
Document where your watermark lives and how recovery works.

Ready for the Quick Test?

Take the Quick Test below. Everyone can take it; if you are logged in, your progress will be saved.

Menu

Incremental Loads And Watermarks

Table of Contents

Why this matters

Concept explained simply

Mental model

Core patterns you will use

Worked examples

1) SQL source with updated_at

2) API with page tokens and event IDs

3) Files in object storage (daily partitions)

Exercises

Self-check checklist

Common mistakes and how to self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Ready for the Quick Test?

Practice Exercises

Design an incremental SQL load with watermarks

Instructions

Expected Output

File-based lookback plan

Incremental Loads And Watermarks — Quick Test

Have questions about Incremental Loads And Watermarks?

AI Assistant