Topic Not Found

Why this matters

As a Machine Learning Engineer, you need current, trustworthy data without overloading systems. Change Data Capture (CDC) and incremental loads let you move only what changed. This keeps feature stores fresh, model monitoring accurate, and training sets reproducible—while controlling cost and latency.

Refresh features only for users/orders that changed.
Backfill late events without rewriting entire tables.
Maintain audit-ready history for model explainability.
Reduce costs by avoiding full scans and full reloads.

Concept explained simply

Full load is like re-copying an entire library every night. Incremental load copies only the new or changed books. CDC is how you detect which books changed.

Mental model

Think of a watermark as a bookmark that remembers "how far you got." Each run starts from that bookmark, reads forward, and updates the bookmark when done. Idempotency means you can safely re-run the same batch without duplicating or corrupting data.

Core concepts

Incremental load: Move only new/changed/deleted records since the last run.
CDC methods:
- Timestamp-based: Filter where updated_at > last_watermark.
- Version/sequence-based: Use monotonically increasing version or ID.
- Log-based: Read change logs (insert/update/delete events) from source.
- Checksum/Hash: Compare hashes to detect differences.
Watermark: The last successfully processed point in time or sequence.
Upsert/Merge: Insert new, update existing, optionally delete removed rows.
Idempotency: Re-running produces the same final state (no duplicates).
Late/out-of-order events: Handle with a sliding window and dedup rules.
Soft vs hard deletes: Deleted rows should be removed or marked as is_deleted.

Deep dive: Exactly-once vs at-least-once

In distributed systems, exactly-once is often implemented as "at-least-once + idempotent writes + dedup." Use stable keys, merge semantics, and deterministic ordering to achieve practically-once outcomes.

Worked examples

Example 1: Timestamp-based incremental upsert

Scenario: You have a source table orders with updated_at and is_deleted.

-- Input watermark provided by your job metadata
-- :watermark is the last successfully processed updated_at
WITH src AS (
  SELECT *
  FROM source.orders
  WHERE updated_at > :watermark
)
MERGE INTO dw.orders AS t
USING (
  SELECT order_id, user_id, amount, status, is_deleted, updated_at
  FROM src
) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.is_deleted = TRUE THEN DELETE
WHEN MATCHED THEN UPDATE SET
  user_id   = s.user_id,
  amount    = s.amount,
  status    = s.status,
  updated_at= s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, user_id, amount, status, is_deleted, updated_at)
VALUES (s.order_id, s.user_id, s.amount, s.status, s.is_deleted, s.updated_at);

-- After successful merge, advance watermark to max(s.updated_at)

Idempotent because MERGE ensures one final row per key.
Handles deletes via soft-delete flag or true DELETE.

Example 2: Log-based CDC with dedup

Scenario: You read change events with op codes (c,u,d) and timestamps (ts_ms).

# Pseudocode
batch = read_changes()
# Deduplicate: keep the latest event per primary key within the batch
latest = (
  batch
  .group_by('order_id')
  .apply(lambda df: df.sort_values(['ts_ms','sequence'], ascending=True).tail(1))
)

# Map op to actions
upserts = latest[latest.op.isin(['c','u'])]
deletes = latest[latest.op == 'd']

# Write idempotently
merge_into_dw(upserts, key='order_id')
apply_deletes_in_dw(deletes, key='order_id')

# Advance watermark = max(ts_ms)

Use a stable ordering (ts_ms, sequence) to get the winning event.
On retry, same batch produces same final state.

Example 3: Incremental feature refresh with a safety window

Scenario: Compute user_daily_spend features from transactions. Late events can arrive up to 24h late.

-- Recompute a 2-day sliding window to capture late events
WITH inc AS (
  SELECT user_id, CAST(event_ts AS DATE) AS d, amount
  FROM raw.transactions
  WHERE event_ts > (:watermark - INTERVAL '24 hours')
), agg AS (
  SELECT user_id, d, SUM(amount) AS daily_spend
  FROM inc
  GROUP BY user_id, d
)
MERGE INTO fs.user_daily_spend AS t
USING agg AS s
ON t.user_id = s.user_id AND t.d = s.d
WHEN MATCHED THEN UPDATE SET daily_spend = s.daily_spend
WHEN NOT MATCHED THEN INSERT (user_id, d, daily_spend) VALUES (s.user_id, s.d, s.daily_spend);

-- Advance watermark to the max processed event_ts

Sliding window re-aggregates recent days to fix late arrivals.
MERGE ensures correct upserts per (user_id, day).

Who this is for and prerequisites

Who: ML engineers, data engineers, and analysts maintaining feature stores or training data pipelines.
Prerequisites: Basic SQL (SELECT, JOIN), understanding of primary keys, comfort with batch jobs or micro-batches.

Learning path

Identify keys and change indicators (timestamps, versions, logs).
Define watermark strategy and where it is stored.
Design idempotent write path (MERGE/upsert + delete handling).
Add deduplication and late-event windows.
Harden with data quality checks and monitoring.
Optimize performance (partitioning, indexes, batch sizing).

Design choices

Sources with updated_at: Use timestamp-based CDC with watermarks.
Sources with versions/sequences: Prefer sequence-based filters to avoid clock drift issues.
Event logs: Use log-based CDC; dedupe by key + highest offset/sequence.
Deletes: Decide on soft-delete (is_deleted) vs hard delete to match downstream needs.
Watermark scope: Global, per-table, or per-partition depending on data skew.

Choosing a watermark granularity

Per-table: simplest; good when volume is moderate and ordering is stable.
Per-partition/shard: use for high volume or uneven arrival patterns.

Data quality and correctness

Deduplication: Keep the latest version per key in a batch.
Constraints: Validate not null on keys; reject malformed rows.
Row count expectations: Alert on big swings vs baseline.
Change reasonability: Track update rates; sudden spikes may indicate source bugs.

Self-check snippet

-- Duplicate key probe in target after write
SELECT order_id, COUNT(*) c
FROM dw.orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Performance tips

Filter at source using watermarks to avoid full scans.
Batch writes (multi-row merges) and use bulk load where possible.
Partition by date/key; index join keys in targets.
Keep micro-batches small enough to meet SLAs but large enough to amortize overhead.
Avoid wide MERGE clauses; only update changed columns when supported.

Exercises

Available to everyone; only logged-in users get saved progress.

Exercise 1: Incremental upsert for an orders table (mirror of Exercise ex1 below).
Exercise 2: Idempotent append-only CDC consumer with dedup (mirror of Exercise ex2 below).

Checklist:
- Identified primary key(s).
- Chose appropriate CDC method.
- Defined watermark and how it advances.
- Ensured idempotent writes.
- Included delete handling.
- Added dedup for replays/out-of-order events.

Common mistakes and how to self-check

Relying on system time instead of source timestamps. Self-check: Compare min/max source updated_at in each run with watermark.
Forgetting delete handling. Self-check: Pick a sample of deleted keys and verify absence or is_deleted=true downstream.
No dedup. Self-check: Count rows per key after write; ensure uniqueness.
Advancing watermark before successful commit. Self-check: Only move watermark after the target commit finishes.
Too narrow window for late events. Self-check: Monitor late-arrival rate and adjust window.

Practical projects

Build an incremental loader for a users table using updated_at. Store watermark in a metadata table and implement MERGE with soft deletes.
Create a mini CDC consumer for an append-only event log. Implement within-batch dedup and a retry that proves idempotency.
Feature refresh job: Recompute a 48-hour sliding window for three features (daily_spend, daily_orders, avg_basket). Validate with unit checks.

Next steps

Add orchestration (scheduling, retries) and alerting.
Instrument row-level counts and watermarks for observability.
Generalize your MERGE patterns into reusable components.

Mini challenge

Your source emits out-of-order updates for the same order_id within 10 minutes. Design a write strategy to ensure the latest event wins while staying idempotent. Hint: deterministic ordering key and MERGE.

Quick Test

Take the quick test to check your understanding. Available to everyone; only logged-in users get saved progress.

Menu

CDC And Incremental Loads

Table of Contents