Why this matters
Incremental loads let you move only what changed since the last run. That makes pipelines faster, cheaper, and more reliable. Watermarks are the simple, durable way to remember how far you got.
- Keep data fresh on tight SLAs without reloading everything.
- Reduce cloud costs and API rate usage.
- Recover gracefully from failures by resuming from the last checkpoint.
Concept explained simply
Incremental load = process only new or changed records since the last successful run. A watermark is a stored checkpoint, like the maximum updated_at, last processed ID, or last page token.
Key definition
Watermark: a persisted value representing the highest change you have safely processed (e.g., max event time, max sequence number, or a durable token). After each successful run, you update it.
Mental model
Imagine a river rising. Your watermark is the paint mark on a pier showing the highest level reached. On the next inspection, you only need to look above that mark. To be safe, you might glance a little below the mark (a lookback window) to catch any late splashes.
Core patterns you will use
- Timestamp-based: query where
updated_at > last_watermark(often with a small overlap/lookback). - Sequence/ID-based: use monotonically increasing IDs or change versions.
- CDC-based: read change streams (insert/update/delete) and commit offsets.
- File-based: track last processed partition/key and reconcile late files with a lookback over recent partitions.
- Upserts (MERGE): idempotent writes so reprocessing doesn’t duplicate data.
- Deduplication: use a stable key (e.g., source primary key + change version).
- Deletes: handle tombstones or soft-delete flags; if not available, occasional full sync + diff.
- Watermark storage: durable place (metadata table, checkpoint file) updated only after a successful load.
Worked examples
1) SQL source with updated_at
Goal: Load only rows changed since the last run, with idempotent upserts.
-- Assume we store watermark in control.watermarks table
-- last_watermark = '2025-06-01 10:00:00'
-- Use 5-minute lookback to catch late updates
WITH src AS (
SELECT *
FROM app.public.orders
WHERE updated_at > (TIMESTAMP :last_watermark - INTERVAL '5 minutes')
),
ranked AS (
-- Deduplicate by primary key keeping the latest version
SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM src
)
MERGE INTO dw.orders t
USING (
SELECT * FROM ranked WHERE rn = 1
) s
ON (t.order_id = s.order_id)
WHEN MATCHED THEN UPDATE SET
status = s.status,
amount = s.amount,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.order_id, s.status, s.amount, s.updated_at);
-- After a fully successful load, compute new_watermark
-- new_watermark = MAX(s.updated_at) from the processed set
-- Persist it to control.watermarks
Notes:
- Lookback (overlap) protects against clock skew or slight delays.
- Dedup + MERGE ensures idempotency.
- Watermark is only advanced after success.
2) API with page tokens and event IDs
Goal: Pull new events incrementally without missing or duplicating.
# Pseudocode
state = load_state() # contains last_event_id and/or page_token
collected = []
while True:
resp = api.list_events(since_id=state.last_event_id, page_token=state.page_token)
collected.extend(resp.items)
if resp.next_page_token is None:
break
state.page_token = resp.next_page_token
# Deduplicate by event_id and keep the latest version if re-delivered
upsert(collected, key='event_id', version_field='updated_at')
# Advance watermark safely
new_last_id = max_event_id(collected)
save_state(last_event_id=new_last_id, page_token=None)
- Prefer stable IDs over timestamps if available.
- Store both last_event_id and last successful page_token during pagination.
- Advance only after successful write.
3) Files in object storage (daily partitions)
Goal: Incrementally process daily folders, handling late-arriving files.
# Example layout: s3://bucket/sales/dt=2025-06-01/part-000.parquet
state = load_state() # last_processed_partition = '2025-05-29'
lookback_days = 2
partitions = list_partitions(prefix='sales/') # e.g., sorted dates
start_from = min_date(max_date(state.last_processed_partition), today - lookback_days)
for p in partitions where p >= start_from:
files = list_files(p)
for f in files:
df = read_parquet(f)
df = deduplicate(df, keys=['order_id'], version='updated_at')
upsert_into_dw(df, key='order_id', version='updated_at')
# Advance watermark only up to the last fully processed partition
save_state(last_processed_partition = max_processed_partition)
- Use a small partition lookback to ingest late files safely.
- Upserts make reprocessing safe.
- Consider a manifest of processed files to avoid re-reading large partitions.
Exercises
Do these now. Your progress on exercises and the test is available to everyone; only logged-in users get saved progress.
-
Design an incremental SQL load with watermarks.
Task
Source table:
customers(id, email, status, updated_at). Destination:dw.customers. Use a 10-minute lookback, deduplicate byid, and write an idempotent MERGE. Describe how and when you store/update the watermark. -
File-based lookback plan.
Task
You receive hourly CSVs in
/landing/events/dt=YYYY-MM-DD/hr=HH/. Create a step-by-step plan to process incrementally with a 3-hour lookback, deduplicate by(event_id), and safely advance the watermark.
Self-check checklist
- [ ] Watermark is durable and updated only after a successful run.
- [ ] There is a lookback/overlap strategy to capture late data.
- [ ] Writes are idempotent (MERGE/upsert) and deduplicated.
- [ ] Deletes are considered (tombstones, soft-deletes, or periodic full compare).
- [ ] Failure recovery resumes from the last saved watermark.
Common mistakes and how to self-check
- No overlap window. Fix: add small lookback by time or partitions, then deduplicate.
- Advancing watermark too early. Fix: update checkpoint only after successful write/commit.
- Assuming timestamps are monotonic. Fix: prefer sequence IDs or add dedup + lookback for clock skew.
- No deduplication. Fix: keep stable keys and last-write-wins logic.
- Ignoring deletes. Fix: process CDC tombstones or schedule periodic reconciliation.
- Single point of failure for state. Fix: store watermarks in reliable metadata storage with concurrency control.
Quick self-audit
- [ ] Can I rerun yesterday’s job without creating duplicates?
- [ ] If the job fails mid-run, do I know exactly where to resume?
- [ ] Can I explain how late-arriving events are captured?
Practical projects
- Build an incremental pipeline from a sample SQL DB to a warehouse using MERGE and a watermark table.
- Create a file-based ingestion flow with a 2-day lookback and a processed-file manifest.
- Prototype a CDC consumer that commits offsets and supports replay without duplication.
Mini challenge
Your API sometimes replays the last 100 events on each call. In one paragraph, describe how you will guarantee no duplicates and never miss an event. Mention your watermark, dedup logic, and when you advance the watermark.
Hint
Use last processed event ID as the watermark, allow overlap by always asking since last_id minus a small buffer, dedup by event_id, advance after successful upsert.
Who this is for
- Aspiring and practicing Data Engineers building reliable ingestion pipelines.
- Analytics Engineers who need faster, cheaper data loads.
- Developers designing data movement between systems.
Prerequisites
- Comfort with SQL (SELECT, JOIN, and MERGE/UPSERT).
- Basic understanding of batch vs. streaming loads.
- Familiarity with file formats (CSV/Parquet) and object storage.
Learning path
- Review incremental vs. full loads and why watermarks matter.
- Practice timestamp- and ID-based patterns with lookbacks.
- Add idempotent writes (MERGE/upsert) and dedup.
- Handle deletes and late-arriving data.
- Harden state management and recovery.
Next steps
- Apply incremental loads to one real source today (even a small table).
- Introduce a lookback in your current pipeline and measure correctness.
- Document where your watermark lives and how recovery works.
Ready for the Quick Test?
Take the Quick Test below. Everyone can take it; if you are logged in, your progress will be saved.