luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Incremental Loads And Watermarks

Learn Incremental Loads And Watermarks for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

Incremental loads let you move only the new or changed data, not the entire dataset. This saves time, money, and reduces risk. Watermarks keep track of progress so your jobs are fast, resume safely after failures, and avoid duplicate records.

  • Daily jobs: Move only rows changed since yesterday.
  • Streaming and CDC: Resume from the last processed offset/sequence.
  • File lakes: Pick up only new or late-arriving files.

Who this is for

  • ETL/ELT developers setting up reliable batch or near-real-time pipelines.
  • Data engineers maintaining data lakes/warehouses.
  • Analysts/engineers optimizing refresh times for dashboards.

Prerequisites

  • Comfort with SQL (SELECT, JOIN, GROUP BY, MERGE/UPSERT).
  • Basic ETL/ELT concepts: staging, target schemas, idempotency.
  • Familiarity with timestamps, sequences, or offsets in data sources.

Concept explained simply

A watermark is a marker of how far you have processed. After each load, you store the highest processed point (like the latest updated_at timestamp, a max ID, or a CDC offset). Next time, you only read data greater than this stored value. If a job fails, you still know the last confirmed point and can resume safely.

Mental model

Think of data as a conveyor belt. The watermark is a sticky note that says "we inspected up to here." During the next run, you start just after that sticky note, inspect a new section, and move the sticky note forward to the furthest confirmed position.

Key terms

  • High-water mark (HWM): The maximum processed value (timestamp, sequence, offset).
  • Low-water mark: The minimum value in a processing window (used less commonly).
  • Idempotency: Re-running a job produces the same final result.
  • Late-arriving data: Records that belong to earlier periods but arrive after the initial load.
  • Lookback window: A small overlap period to capture late arrivals safely.

Design choices

Choosing a watermark field
  • Timestamps (e.g., updated_at): Easy and common; ensure timezone consistency (normalize to UTC).
  • Monotonic IDs/sequences: Reliable if strictly increasing and gap-free for order.
  • CDC offsets or log positions: Best for change streams; track per-partition offsets.
  • File systems: Use folder date, file modified time, or a manifest index.
Where to store watermarks
  • Control table (preferred): e.g., watermark(state_key, value, updated_on).
  • Metadata store: Same database as target warehouse or a durable key-value store.
  • Store per source + per partition if needed (e.g., Kafka partitions).
Handling late data and duplicates
  • Add a small lookback window (e.g., 5–60 minutes) when reading new data.
  • Use deduplication keys (natural key + last_update) in staging before MERGE.
  • Design MERGE/UPSERT to be idempotent.
Timezones and clock skew
  • Convert all times to UTC at ingestion.
  • If sources have skew, rely on server-time or CDC sequence instead of client-time.
Deletes
  • Soft deletes: propagate a deleted flag and update targets via MERGE.
  • Hard deletes: require CDC with tombstones or periodic anti-join reconciliation.

Worked examples

Example 1: Database table using updated_at

  1. Control table stores last_hwm = 2025-03-01T00:00:00Z.
  2. Extract WHERE updated_at >= (last_hwm - lookback 10m).
  3. Stage deduplicated rows by (id, updated_at) choosing the latest.
  4. MERGE into target by business key; insert new, update changed.
  5. Set new_hwm = MAX(updated_at) that actually loaded; commit both MERGE and watermark in one transaction when possible.
-- Read delta with lookback
SELECT *
FROM src.orders
WHERE updated_at >= TIMESTAMP '2025-02-28 23:50:00+00';

-- Update watermark after success
UPDATE ctl.watermark
SET value = '2025-03-01T10:15:00Z'
WHERE state_key = 'orders.updated_at';

Example 2: Object storage files by partition date

  1. Watermark holds last processed partition_date: 2025-02-15.
  2. Discover files in partitions >= 2025-02-15 - 1 day lookback.
  3. Load new files; dedup by file name or internal record key to avoid repeats.
  4. Advance watermark to the max loaded partition_date and persist a processed manifest.
# Pseudocode
candidates = list_partitions(base='s3://lake/sales/', from=wm_date - 1d)
for p in sorted(candidates):
  for f in list_files(p):
    if not processed_manifest.contains(f):
      load_file_to_stage(f)
      processed_manifest.add(f)
wm_date = max_partition_loaded

Example 3: Kafka CDC with per-partition offsets

  1. Watermark table stores offsets per partition: topic=customers_cdc, partition=0..N.
  2. On start, seek each partition to stored offset.
  3. Process events; for upserts, keep latest by primary key and commit in micro-batches.
  4. After each micro-batch, atomically persist new offsets for each partition.
// Offsets table
state_key: 'customers_cdc:p{partition}', value: last_committed_offset

// Processing loop
for batch in read_stream():
  apply_merge(batch)
  commit_offsets(batch.max_offsets)

Patterns and practical snippets

Idempotent MERGE template (SQL)
MERGE INTO dim_orders AS tgt
USING (SELECT * FROM stage_orders) AS src
  ON tgt.order_id = src.order_id
WHEN MATCHED AND (tgt.hash != src.hash OR tgt.is_deleted != src.is_deleted) THEN
  UPDATE SET amount = src.amount,
             status = src.status,
             is_deleted = src.is_deleted,
             updated_at = src.updated_at,
             hash = src.hash
WHEN NOT MATCHED THEN
  INSERT (order_id, amount, status, is_deleted, created_at, updated_at, hash)
  VALUES (src.order_id, src.amount, src.status, src.is_deleted, src.created_at, src.updated_at, src.hash);
Watermark control table
CREATE TABLE IF NOT EXISTS ctl_watermark (
  state_key VARCHAR PRIMARY KEY,
  value_str VARCHAR NOT NULL,
  updated_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- Examples:
-- 'orders.updated_at' -> '2025-03-01T10:15:00Z'
-- 'kafka.customers_cdc.p0' -> '123456789'
Late-arriving lookback
-- For timestamp-based HWM, overlap by lookback_minutes
WITH params AS (
  SELECT (SELECT value_str FROM ctl_watermark WHERE state_key = 'orders.updated_at') AS last_hwm
)
SELECT *
FROM src.orders o, params p
WHERE o.updated_at >= (CAST(p.last_hwm AS TIMESTAMP) - INTERVAL '10 minutes');

Build-right checklist

  • [ ] Choose a reliable watermark field or offset.
  • [ ] Normalize timestamps to UTC.
  • [ ] Add a lookback window appropriate for source latency.
  • [ ] Deduplicate in staging before MERGE.
  • [ ] Make MERGE idempotent and key-driven.
  • [ ] Store and update watermarks atomically with load success.
  • [ ] Track per-partition offsets for streaming.
  • [ ] Monitor for staleness and late-arrival rates.

Exercises

Do these hands-on tasks. The same exercises are available below with solution reveals. Quick Test is available to everyone; only logged-in users get saved progress.

  1. Exercise 1 — Database incremental MERGE with watermark
    Design an incremental load for table orders using updated_at as the watermark. Create: (a) control table DDL, (b) SELECT to read delta with a 10-minute lookback, (c) idempotent MERGE, and (d) statement to advance the watermark to the max processed updated_at. Include a short note on transaction boundaries.
  2. Exercise 2 — File-based watermark and manifest
    Given daily partitions in a data lake (sales/partition_date=YYYY-MM-DD/...), outline pseudocode to: (a) read last processed partition from a control table, (b) discover new partitions with a 1-day lookback, (c) load only unprocessed files using a manifest, (d) update the watermark.

Common mistakes and self-check

  • Mistake: Using local time instead of UTC. Self-check: Do your queries convert to UTC before comparison?
  • Mistake: No lookback; missing late data. Self-check: Can a record arriving a few minutes late still be captured?
  • Mistake: Advancing watermark before commit. Self-check: Are MERGE and watermark update in the same atomic unit?
  • Mistake: Dedup after MERGE. Self-check: Do you deduplicate in staging by natural key + last_update?
  • Mistake: Single global watermark for multi-partition streams. Self-check: Do you track offsets per partition?
  • Mistake: Ignoring deletes. Self-check: Do you propagate deleted flags or handle tombstones?

Practical projects

  • Implement an orders incremental pipeline (DB to warehouse) with a 15-minute lookback and idempotent MERGE. Add monitoring on rows inserted/updated and watermark value.
  • Build a file-lake loader that reads partitioned sales data, keeps a processed-file manifest, and updates a partition watermark.
  • Create a CDC consumer that tracks per-partition offsets and performs upsert + soft-delete handling into a dimension table.

Learning path

  • Before this: Data modeling basics and staging patterns.
  • Now: Incremental loads and watermarks (this page).
  • Next: Slowly Changing Dimensions (SCD) and change propagation; CDC ingestion patterns; Retry and backoff strategies.

Next steps

  • Complete the two exercises and verify with the solutions.
  • Take the Quick Test to confirm understanding (progress saved for logged-in users).
  • Apply the pattern in a small project with real timings and monitoring.

Mini challenge

Your source has updated_at in local time and sometimes arrives late by 20 minutes. Propose exact SQL (or pseudocode) to normalize to UTC, apply a 30-minute lookback, deduplicate, MERGE, and update the watermark. State how you ensure idempotency and atomicity.

Practice Exercises

2 exercises to complete

Instructions

Design an incremental load for table orders using updated_at as the watermark.

  1. Create a control table DDL to store the watermark (state_key, value, updated_on).
  2. Write a SELECT that reads only changed rows since the last watermark, using a 10-minute lookback.
  3. Write an idempotent MERGE into dim_orders keyed by order_id with a hash or field comparison to detect changes; handle soft-deletes if present.
  4. Advance the watermark to the maximum processed updated_at. Ensure watermark update is atomic with the MERGE (transaction or unit-of-work).
Expected Output
A runnable set of SQL statements: CREATE TABLE for watermark; SELECT with lookback; MERGE that inserts/updates correctly; UPDATE to watermark with the new max updated_at value.

Incremental Loads And Watermarks — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Incremental Loads And Watermarks?

AI Assistant

Ask questions about this tool