Who this is for
Data Engineers and platform-minded developers who need to move data reliably from OLTP systems into warehouses or data lakes without full reloads.
Prerequisites
- Comfort with SQL (SELECT, JOIN, WHERE, MERGE/UPSERT)
- Basic understanding of relational databases and transaction logs
- Familiarity with batch vs. streaming concepts
Why this matters
Real teams use CDC to:
- Keep analytics tables fresh without expensive full-table scans
- Replicate production DB changes to a warehouse or lake within minutes
- Feed downstream microservices with reliable change streams
- Enable point-in-time recovery or audit trails
Real tasks you will do
- Design a high-water mark incremental for a PostgreSQL table with millions of rows
- Configure log-based CDC to capture inserts/updates/deletes
- Apply idempotent merges so replays don’t corrupt data
- Handle schema changes and backfills safely
Concept explained simply
Change Data Capture (CDC) is a way to move only what changed since last time. Instead of copying an entire table, you track inserts, updates, and deletes and apply them downstream.
Mental model
Think of your source table as a diary. CDC is copying only new diary entries since the last page you copied. You bookmark the last page number (watermark) and continue from there next time. If your diary also shows edits (transaction log), you can replay exactly what happened in the right order.
Three common CDC styles
- Query-based incremental: SELECT rows where updated_at > last_watermark
- Log-based CDC: read the database transaction log (e.g., binlog/LSN) for exact row-level changes
- Trigger-based: DB triggers write changes into side tables (simple, but adds load and complexity)
Key patterns and options
Query-based incremental (high‑water mark)
- Requires a monotonic column (e.g., updated_at or an increasing ID)
- Store the last processed value; fetch strictly greater (>) to avoid duplicates
- Handle clock skew and late updates with a small overlap window and dedupe downstream
Log-based CDC
- Reads the database’s change log for exact inserts/updates/deletes in order
- Gives low latency and correct ordering; supports deletes well
- Needs connectors and careful offset management (at-least-once is typical)
Trigger-based CDC
- DB triggers capture changes into a delta table
- Easy to reason about; may add write overhead and require DBA involvement
Initial snapshot and backfills
- First, copy the current full state (snapshot)
- Then, switch to incremental changes using high-water mark or log offsets
- Plan replays: if the consumer fails, you can reprocess from last persisted offset
Ordering, idempotency, deduplication
- Ordering: apply changes in commit order (log-based) or by event time
- Idempotency: repeatable merges; applying the same event twice yields the same result
- Deduplication: use keys like (pk, event_sequence or LSN) to keep only the latest per key
Deletes and tombstones
- Soft delete: mark a flag (deleted=true)
- Hard delete: emit a delete event and remove downstream rows
- Compaction: in append-only lakes, periodically merge to a clean table
Schema changes
- Detect new columns; default them safely downstream
- For breaking changes (type changes), stage changes and perform backfills
Warehouse/lake application patterns
- Merge pattern (UPSERT): MERGE target USING staged_changes ON key THEN UPDATE/INSERT; handle delete events
- Append + periodic compact: append all events; compact into current-state tables
- SCD2 for history: create new versions on updates, close old versions with valid_to timestamps
Worked examples
Example 1: Incremental by updated_at (PostgreSQL)
-- Assume last_watermark = '2026-01-07T12:00:00Z'
WITH delta AS (
SELECT *
FROM public.customers
WHERE updated_at > TIMESTAMP '2026-01-07 12:00:00+00'
)
-- Stage then merge into warehouse.customers_current
MERGE INTO warehouse.customers_current t
USING delta s
ON (t.customer_id = s.customer_id)
WHEN MATCHED THEN UPDATE SET
name = s.name,
email = s.email,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);
-- Save new watermark = MAX(s.updated_at)Why this works
We only read rows after the last processed timestamp and then save the new max timestamp as our next start point. A small overlap window (e.g., >= last_watermark minus 2 minutes) plus dedupe can protect against clock skew.
Example 2: Log-based stream with Kafka-style events
-- Events have fields: op (c/u/d), key (order_id), ts, and after/before objects
-- Deduplicate by (order_id, ts) keeping the latest ts per key
WITH ranked AS (
SELECT e.*, ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) AS rn
FROM staging.order_events e
)
, latest AS (
SELECT * FROM ranked WHERE rn = 1
)
MERGE INTO dw.orders t
USING latest s
ON (t.order_id = s.key)
WHEN MATCHED AND s.op IN ('u','c') THEN UPDATE SET
status = s.after.status,
amount = s.after.amount,
updated_at = s.ts
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN NOT MATCHED AND s.op IN ('u','c') THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.key, s.after.status, s.after.amount, s.ts);Key takeaway
At-least-once delivery is common. Rank and keep the latest per key before you MERGE to avoid duplicate or out-of-order updates.
Example 3: SCD Type 2 in a lakehouse
-- Input: CDC changes for products
-- Goal: keep history of attribute changes
MERGE INTO lake.products_scd2 t
USING staged_changes s
ON (t.product_id = s.product_id AND t.is_current = true)
WHEN MATCHED AND s.op IN ('c','u') AND
(t.name != s.after.name OR t.price != s.after.price) THEN UPDATE SET
is_current = false,
valid_to = s.ts
WHEN NOT MATCHED AND s.op IN ('c','u') THEN INSERT (
product_id, name, price, valid_from, valid_to, is_current
) VALUES (
s.product_id, s.after.name, s.after.price, s.ts, NULL, true
)
WHEN MATCHED AND s.op = 'd' THEN UPDATE SET
is_current = false,
valid_to = s.ts;Why SCD2
You preserve historical versions while marking the latest as current. CDC events become version boundaries.
Step-by-step: build a small CDC pipeline
- Pick the capture method: high-water mark (simple) or log-based (more precise).
- Plan the initial snapshot: copy full table once; record the starting watermark/offset.
- Stage changes: land deltas into a raw/staging area first.
- Deduplicate and order: use keys and event timestamps or log sequence numbers.
- Apply changes: MERGE/UPSERT into target; include delete handling.
- Persist progress: store watermark/offset in a control table.
- Monitor: alert on lag, null watermarks, or rising dedupe rates.
Gotchas to watch
- Time zones: normalize timestamps to UTC
- Clock skew: add overlap windows and dedupe
- Large deletes: batch deletes or mark soft-deletes first
- Schema drift: default new columns and backfill later
Exercises
These mirror the exercises below. Do them here, then open the solutions only after you attempt.
Exercise 1: Design a watermark-based incremental
- Source: PostgreSQL table public.customers(customer_id PK, name, email, updated_at TIMESTAMPTZ)
- Task: Draft the SQL to pull deltas using updated_at and a MERGE to apply into dw.customers_current
- Decide how you store and update the watermark
Exercise 2: Idempotent upsert from CDC events
- Given events with fields (op, key, ts, after, before) and possible duplicates
- Task: Show how you dedupe by (key, ts) and MERGE into dw.orders
- Include delete handling
Self-check checklist
- You used > (strictly greater than) for the watermark filter
- You saved the new MAX(updated_at) after a successful load
- Your MERGE applies deletes correctly
- You deduplicated by a stable ordering key (ts or LSN)
- Your pipeline is idempotent if re-run
Common mistakes
- Using >= watermark instead of > and forgetting dedupe, causing duplicates
- Ignoring deletes; downstream tables never remove rows
- Not persisting offsets/watermarks safely; restarts reprocess too much or too little
- No overlap window; missing late-arriving changes
- Assuming event time equals commit order; rank carefully when ordering matters
How to self-check
- Re-run the same batch twice; results must be identical (idempotent)
- Delete a row upstream; verify downstream is deleted or marked deleted
- Inject duplicates; verify dedupe removes extras
- Change schema by adding a column; downstream should not fail hard
Practical projects
- Build a CDC pipeline that snapshots a source table and then switches to incremental updates every 5 minutes. Track watermarks in a control table.
- Create a lakehouse current-state table and a history SCD2 table fed from the same CDC stream. Add daily compaction.
- Add monitoring: a simple dashboard showing lag (minutes since last change), number of deduped events, and last successful watermark.
Learning path
- Start with high-water mark incrementals on a single table
- Add delete handling and idempotent MERGE patterns
- Move to log-based CDC with staging and ordering
- Introduce SCD2 and compaction in a data lake
- Scale to multiple tables and monitor lag/health
Next steps
- Implement CDC on one critical table in your environment
- Add a control table for watermarks/offsets
- Extend to 3–5 tables and standardize your MERGE templates
Progress saving note
The quick test is available to everyone. Sign in to save your progress and resume later.
Mini challenge
You receive CDC events for invoices where updates can arrive out of order. In one run, you see invoice 123 updated at 10:02 and then an older event at 10:01. Show how you would ensure the 10:02 state wins in your target table without losing history. Hint: rank by event time or sequence and use SCD2 or a current-state MERGE with a WHERE s.ts > t.updated_at guard.