Menu

Topic 6 of 8

CDC Concepts And Change Capture

Learn CDC Concepts And Change Capture for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

Data Engineers and platform-minded developers who need to move data reliably from OLTP systems into warehouses or data lakes without full reloads.

Prerequisites

  • Comfort with SQL (SELECT, JOIN, WHERE, MERGE/UPSERT)
  • Basic understanding of relational databases and transaction logs
  • Familiarity with batch vs. streaming concepts

Why this matters

Real teams use CDC to:

  • Keep analytics tables fresh without expensive full-table scans
  • Replicate production DB changes to a warehouse or lake within minutes
  • Feed downstream microservices with reliable change streams
  • Enable point-in-time recovery or audit trails
Real tasks you will do
  • Design a high-water mark incremental for a PostgreSQL table with millions of rows
  • Configure log-based CDC to capture inserts/updates/deletes
  • Apply idempotent merges so replays don’t corrupt data
  • Handle schema changes and backfills safely

Concept explained simply

Change Data Capture (CDC) is a way to move only what changed since last time. Instead of copying an entire table, you track inserts, updates, and deletes and apply them downstream.

Mental model

Think of your source table as a diary. CDC is copying only new diary entries since the last page you copied. You bookmark the last page number (watermark) and continue from there next time. If your diary also shows edits (transaction log), you can replay exactly what happened in the right order.

Three common CDC styles
  • Query-based incremental: SELECT rows where updated_at > last_watermark
  • Log-based CDC: read the database transaction log (e.g., binlog/LSN) for exact row-level changes
  • Trigger-based: DB triggers write changes into side tables (simple, but adds load and complexity)

Key patterns and options

Query-based incremental (high‑water mark)

  • Requires a monotonic column (e.g., updated_at or an increasing ID)
  • Store the last processed value; fetch strictly greater (>) to avoid duplicates
  • Handle clock skew and late updates with a small overlap window and dedupe downstream

Log-based CDC

  • Reads the database’s change log for exact inserts/updates/deletes in order
  • Gives low latency and correct ordering; supports deletes well
  • Needs connectors and careful offset management (at-least-once is typical)

Trigger-based CDC

  • DB triggers capture changes into a delta table
  • Easy to reason about; may add write overhead and require DBA involvement

Initial snapshot and backfills

  • First, copy the current full state (snapshot)
  • Then, switch to incremental changes using high-water mark or log offsets
  • Plan replays: if the consumer fails, you can reprocess from last persisted offset

Ordering, idempotency, deduplication

  • Ordering: apply changes in commit order (log-based) or by event time
  • Idempotency: repeatable merges; applying the same event twice yields the same result
  • Deduplication: use keys like (pk, event_sequence or LSN) to keep only the latest per key

Deletes and tombstones

  • Soft delete: mark a flag (deleted=true)
  • Hard delete: emit a delete event and remove downstream rows
  • Compaction: in append-only lakes, periodically merge to a clean table

Schema changes

  • Detect new columns; default them safely downstream
  • For breaking changes (type changes), stage changes and perform backfills
Warehouse/lake application patterns
  • Merge pattern (UPSERT): MERGE target USING staged_changes ON key THEN UPDATE/INSERT; handle delete events
  • Append + periodic compact: append all events; compact into current-state tables
  • SCD2 for history: create new versions on updates, close old versions with valid_to timestamps

Worked examples

Example 1: Incremental by updated_at (PostgreSQL)

-- Assume last_watermark = '2026-01-07T12:00:00Z'
WITH delta AS (
  SELECT *
  FROM public.customers
  WHERE updated_at > TIMESTAMP '2026-01-07 12:00:00+00'
)
-- Stage then merge into warehouse.customers_current
MERGE INTO warehouse.customers_current t
USING delta s
ON (t.customer_id = s.customer_id)
WHEN MATCHED THEN UPDATE SET
  name = s.name,
  email = s.email,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);
-- Save new watermark = MAX(s.updated_at)
Why this works

We only read rows after the last processed timestamp and then save the new max timestamp as our next start point. A small overlap window (e.g., >= last_watermark minus 2 minutes) plus dedupe can protect against clock skew.

Example 2: Log-based stream with Kafka-style events

-- Events have fields: op (c/u/d), key (order_id), ts, and after/before objects
-- Deduplicate by (order_id, ts) keeping the latest ts per key
WITH ranked AS (
  SELECT e.*, ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) AS rn
  FROM staging.order_events e
)
, latest AS (
  SELECT * FROM ranked WHERE rn = 1
)
MERGE INTO dw.orders t
USING latest s
ON (t.order_id = s.key)
WHEN MATCHED AND s.op IN ('u','c') THEN UPDATE SET
  status = s.after.status,
  amount = s.after.amount,
  updated_at = s.ts
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN NOT MATCHED AND s.op IN ('u','c') THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.key, s.after.status, s.after.amount, s.ts);
Key takeaway

At-least-once delivery is common. Rank and keep the latest per key before you MERGE to avoid duplicate or out-of-order updates.

Example 3: SCD Type 2 in a lakehouse

-- Input: CDC changes for products
-- Goal: keep history of attribute changes
MERGE INTO lake.products_scd2 t
USING staged_changes s
ON (t.product_id = s.product_id AND t.is_current = true)
WHEN MATCHED AND s.op IN ('c','u') AND 
     (t.name != s.after.name OR t.price != s.after.price) THEN UPDATE SET
  is_current = false,
  valid_to = s.ts
WHEN NOT MATCHED AND s.op IN ('c','u') THEN INSERT (
  product_id, name, price, valid_from, valid_to, is_current
) VALUES (
  s.product_id, s.after.name, s.after.price, s.ts, NULL, true
)
WHEN MATCHED AND s.op = 'd' THEN UPDATE SET
  is_current = false,
  valid_to = s.ts;
Why SCD2

You preserve historical versions while marking the latest as current. CDC events become version boundaries.

Step-by-step: build a small CDC pipeline

  1. Pick the capture method: high-water mark (simple) or log-based (more precise).
  2. Plan the initial snapshot: copy full table once; record the starting watermark/offset.
  3. Stage changes: land deltas into a raw/staging area first.
  4. Deduplicate and order: use keys and event timestamps or log sequence numbers.
  5. Apply changes: MERGE/UPSERT into target; include delete handling.
  6. Persist progress: store watermark/offset in a control table.
  7. Monitor: alert on lag, null watermarks, or rising dedupe rates.
Gotchas to watch
  • Time zones: normalize timestamps to UTC
  • Clock skew: add overlap windows and dedupe
  • Large deletes: batch deletes or mark soft-deletes first
  • Schema drift: default new columns and backfill later

Exercises

These mirror the exercises below. Do them here, then open the solutions only after you attempt.

Exercise 1: Design a watermark-based incremental

  • Source: PostgreSQL table public.customers(customer_id PK, name, email, updated_at TIMESTAMPTZ)
  • Task: Draft the SQL to pull deltas using updated_at and a MERGE to apply into dw.customers_current
  • Decide how you store and update the watermark

Exercise 2: Idempotent upsert from CDC events

  • Given events with fields (op, key, ts, after, before) and possible duplicates
  • Task: Show how you dedupe by (key, ts) and MERGE into dw.orders
  • Include delete handling
Self-check checklist
  • You used > (strictly greater than) for the watermark filter
  • You saved the new MAX(updated_at) after a successful load
  • Your MERGE applies deletes correctly
  • You deduplicated by a stable ordering key (ts or LSN)
  • Your pipeline is idempotent if re-run

Common mistakes

  • Using >= watermark instead of > and forgetting dedupe, causing duplicates
  • Ignoring deletes; downstream tables never remove rows
  • Not persisting offsets/watermarks safely; restarts reprocess too much or too little
  • No overlap window; missing late-arriving changes
  • Assuming event time equals commit order; rank carefully when ordering matters
How to self-check
  • Re-run the same batch twice; results must be identical (idempotent)
  • Delete a row upstream; verify downstream is deleted or marked deleted
  • Inject duplicates; verify dedupe removes extras
  • Change schema by adding a column; downstream should not fail hard

Practical projects

  • Build a CDC pipeline that snapshots a source table and then switches to incremental updates every 5 minutes. Track watermarks in a control table.
  • Create a lakehouse current-state table and a history SCD2 table fed from the same CDC stream. Add daily compaction.
  • Add monitoring: a simple dashboard showing lag (minutes since last change), number of deduped events, and last successful watermark.

Learning path

  1. Start with high-water mark incrementals on a single table
  2. Add delete handling and idempotent MERGE patterns
  3. Move to log-based CDC with staging and ordering
  4. Introduce SCD2 and compaction in a data lake
  5. Scale to multiple tables and monitor lag/health

Next steps

  • Implement CDC on one critical table in your environment
  • Add a control table for watermarks/offsets
  • Extend to 3–5 tables and standardize your MERGE templates
Progress saving note

The quick test is available to everyone. Sign in to save your progress and resume later.

Mini challenge

You receive CDC events for invoices where updates can arrive out of order. In one run, you see invoice 123 updated at 10:02 and then an older event at 10:01. Show how you would ensure the 10:02 state wins in your target table without losing history. Hint: rank by event time or sequence and use SCD2 or a current-state MERGE with a WHERE s.ts > t.updated_at guard.

Practice Exercises

2 exercises to complete

Instructions

Source: PostgreSQL public.customers(customer_id PK, name, email, updated_at TIMESTAMPTZ). Target: dw.customers_current.

  • Write a SELECT that fetches only rows after a stored watermark value.
  • Write a MERGE/UPSERT to apply into dw.customers_current.
  • Describe where you will store and update the watermark.
Expected Output
SQL using updated_at > last_watermark; MERGE/UPSERT applying changes; clear plan to persist new MAX(updated_at) in a control table after successful load.

CDC Concepts And Change Capture — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about CDC Concepts And Change Capture?

AI Assistant

Ask questions about this tool