How to learn CDC Concepts And Change Capture for Data Ingestion in Data Engineer for free

Who this is for

Data Engineers and platform-minded developers who need to move data reliably from OLTP systems into warehouses or data lakes without full reloads.

Prerequisites

Comfort with SQL (SELECT, JOIN, WHERE, MERGE/UPSERT)
Basic understanding of relational databases and transaction logs
Familiarity with batch vs. streaming concepts

Why this matters

Real teams use CDC to:

Keep analytics tables fresh without expensive full-table scans
Replicate production DB changes to a warehouse or lake within minutes
Feed downstream microservices with reliable change streams
Enable point-in-time recovery or audit trails

Real tasks you will do

Design a high-water mark incremental for a PostgreSQL table with millions of rows
Configure log-based CDC to capture inserts/updates/deletes
Apply idempotent merges so replays don’t corrupt data
Handle schema changes and backfills safely

Concept explained simply

Change Data Capture (CDC) is a way to move only what changed since last time. Instead of copying an entire table, you track inserts, updates, and deletes and apply them downstream.

Mental model

Think of your source table as a diary. CDC is copying only new diary entries since the last page you copied. You bookmark the last page number (watermark) and continue from there next time. If your diary also shows edits (transaction log), you can replay exactly what happened in the right order.

Three common CDC styles

Query-based incremental: SELECT rows where updated_at > last_watermark
Log-based CDC: read the database transaction log (e.g., binlog/LSN) for exact row-level changes
Trigger-based: DB triggers write changes into side tables (simple, but adds load and complexity)

Key patterns and options

Query-based incremental (high‑water mark)

Requires a monotonic column (e.g., updated_at or an increasing ID)
Store the last processed value; fetch strictly greater (>) to avoid duplicates
Handle clock skew and late updates with a small overlap window and dedupe downstream

Log-based CDC

Reads the database’s change log for exact inserts/updates/deletes in order
Gives low latency and correct ordering; supports deletes well
Needs connectors and careful offset management (at-least-once is typical)

Trigger-based CDC

DB triggers capture changes into a delta table
Easy to reason about; may add write overhead and require DBA involvement

Initial snapshot and backfills

First, copy the current full state (snapshot)
Then, switch to incremental changes using high-water mark or log offsets
Plan replays: if the consumer fails, you can reprocess from last persisted offset

Ordering, idempotency, deduplication

Ordering: apply changes in commit order (log-based) or by event time
Idempotency: repeatable merges; applying the same event twice yields the same result
Deduplication: use keys like (pk, event_sequence or LSN) to keep only the latest per key

Deletes and tombstones

Soft delete: mark a flag (deleted=true)
Hard delete: emit a delete event and remove downstream rows
Compaction: in append-only lakes, periodically merge to a clean table

Schema changes

Detect new columns; default them safely downstream
For breaking changes (type changes), stage changes and perform backfills

Warehouse/lake application patterns

Merge pattern (UPSERT): MERGE target USING staged_changes ON key THEN UPDATE/INSERT; handle delete events
Append + periodic compact: append all events; compact into current-state tables
SCD2 for history: create new versions on updates, close old versions with valid_to timestamps

Worked examples

Example 1: Incremental by updated_at (PostgreSQL)

-- Assume last_watermark = '2026-01-07T12:00:00Z'
WITH delta AS (
  SELECT *
  FROM public.customers
  WHERE updated_at > TIMESTAMP '2026-01-07 12:00:00+00'
)
-- Stage then merge into warehouse.customers_current
MERGE INTO warehouse.customers_current t
USING delta s
ON (t.customer_id = s.customer_id)
WHEN MATCHED THEN UPDATE SET
  name = s.name,
  email = s.email,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, name, email, updated_at)
VALUES (s.customer_id, s.name, s.email, s.updated_at);
-- Save new watermark = MAX(s.updated_at)

Why this works

We only read rows after the last processed timestamp and then save the new max timestamp as our next start point. A small overlap window (e.g., >= last_watermark minus 2 minutes) plus dedupe can protect against clock skew.

Example 2: Log-based stream with Kafka-style events

-- Events have fields: op (c/u/d), key (order_id), ts, and after/before objects
-- Deduplicate by (order_id, ts) keeping the latest ts per key
WITH ranked AS (
  SELECT e.*, ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) AS rn
  FROM staging.order_events e
)
, latest AS (
  SELECT * FROM ranked WHERE rn = 1
)
MERGE INTO dw.orders t
USING latest s
ON (t.order_id = s.key)
WHEN MATCHED AND s.op IN ('u','c') THEN UPDATE SET
  status = s.after.status,
  amount = s.after.amount,
  updated_at = s.ts
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN NOT MATCHED AND s.op IN ('u','c') THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.key, s.after.status, s.after.amount, s.ts);

Key takeaway

At-least-once delivery is common. Rank and keep the latest per key before you MERGE to avoid duplicate or out-of-order updates.

Example 3: SCD Type 2 in a lakehouse

-- Input: CDC changes for products
-- Goal: keep history of attribute changes
MERGE INTO lake.products_scd2 t
USING staged_changes s
ON (t.product_id = s.product_id AND t.is_current = true)
WHEN MATCHED AND s.op IN ('c','u') AND 
     (t.name != s.after.name OR t.price != s.after.price) THEN UPDATE SET
  is_current = false,
  valid_to = s.ts
WHEN NOT MATCHED AND s.op IN ('c','u') THEN INSERT (
  product_id, name, price, valid_from, valid_to, is_current
) VALUES (
  s.product_id, s.after.name, s.after.price, s.ts, NULL, true
)
WHEN MATCHED AND s.op = 'd' THEN UPDATE SET
  is_current = false,
  valid_to = s.ts;

Why SCD2

You preserve historical versions while marking the latest as current. CDC events become version boundaries.

Step-by-step: build a small CDC pipeline

Pick the capture method: high-water mark (simple) or log-based (more precise).
Plan the initial snapshot: copy full table once; record the starting watermark/offset.
Stage changes: land deltas into a raw/staging area first.
Deduplicate and order: use keys and event timestamps or log sequence numbers.
Apply changes: MERGE/UPSERT into target; include delete handling.
Persist progress: store watermark/offset in a control table.
Monitor: alert on lag, null watermarks, or rising dedupe rates.

Gotchas to watch

Time zones: normalize timestamps to UTC
Clock skew: add overlap windows and dedupe
Large deletes: batch deletes or mark soft-deletes first
Schema drift: default new columns and backfill later

Exercises

These mirror the exercises below. Do them here, then open the solutions only after you attempt.

Exercise 1: Design a watermark-based incremental

Source: PostgreSQL table public.customers(customer_id PK, name, email, updated_at TIMESTAMPTZ)
Task: Draft the SQL to pull deltas using updated_at and a MERGE to apply into dw.customers_current
Decide how you store and update the watermark

Exercise 2: Idempotent upsert from CDC events

Given events with fields (op, key, ts, after, before) and possible duplicates
Task: Show how you dedupe by (key, ts) and MERGE into dw.orders
Include delete handling

Self-check checklist

You used > (strictly greater than) for the watermark filter
You saved the new MAX(updated_at) after a successful load
Your MERGE applies deletes correctly
You deduplicated by a stable ordering key (ts or LSN)
Your pipeline is idempotent if re-run

Common mistakes

Using >= watermark instead of > and forgetting dedupe, causing duplicates
Ignoring deletes; downstream tables never remove rows
Not persisting offsets/watermarks safely; restarts reprocess too much or too little
No overlap window; missing late-arriving changes
Assuming event time equals commit order; rank carefully when ordering matters

How to self-check

Re-run the same batch twice; results must be identical (idempotent)
Delete a row upstream; verify downstream is deleted or marked deleted
Inject duplicates; verify dedupe removes extras
Change schema by adding a column; downstream should not fail hard

Practical projects

Build a CDC pipeline that snapshots a source table and then switches to incremental updates every 5 minutes. Track watermarks in a control table.
Create a lakehouse current-state table and a history SCD2 table fed from the same CDC stream. Add daily compaction.
Add monitoring: a simple dashboard showing lag (minutes since last change), number of deduped events, and last successful watermark.

Learning path

Start with high-water mark incrementals on a single table
Add delete handling and idempotent MERGE patterns
Move to log-based CDC with staging and ordering
Introduce SCD2 and compaction in a data lake
Scale to multiple tables and monitor lag/health

Next steps

Implement CDC on one critical table in your environment
Add a control table for watermarks/offsets
Extend to 3–5 tables and standardize your MERGE templates

Progress saving note

The quick test is available to everyone. Sign in to save your progress and resume later.

Mini challenge

You receive CDC events for invoices where updates can arrive out of order. In one run, you see invoice 123 updated at 10:02 and then an older event at 10:01. Show how you would ensure the 10:02 state wins in your target table without losing history. Hint: rank by event time or sequence and use SCD2 or a current-state MERGE with a WHERE s.ts > t.updated_at guard.

Menu

CDC Concepts And Change Capture

Table of Contents