Topic Not Found

Why this matters

As a Data Architect, you must design pipelines that move only changed data reliably. CDC and incremental loads reduce cost, shrink load windows, enable near real-time analytics, and prevent stale dashboards. Real tasks include planning watermark strategies, choosing between log-based CDC vs. query-based increments, handling deletes, and ensuring idempotent, ordered loads in production.

Typical scenarios you will face

Migrating an OLTP database to a warehouse without daily full refreshes
Near real-time event streams feeding operational reporting
Slowly Changing Dimensions (SCD) for customer profiles
Backfilling late-arriving events safely

Concept explained simply

Change Data Capture (CDC) means detecting what changed since last time. Incremental loads are the practical process of loading only those changes.

CDC sources: database logs (e.g., transaction logs), change tables, timestamps, version numbers, or message streams.
Incremental load: a job that reads changes after a checkpoint (watermark), merges them into targets, and records a new checkpoint.

Mental model

Think of your data flow as a conveyor belt:

The belt carries only new or changed boxes (rows).
A gate keeps track of the last box that passed (watermark/checkpoint).
At the end, a sorter merges boxes into shelves: insert new, update existing, remove deleted.

Deep dive: ETL vs. ELT with CDC

ETL: transform changes before loading to target tables; good for strict governance and curated models. ELT: land raw changes first, then transform inside the warehouse/lake; good for agility and scale. CDC fits both—choose based on cost, latency, and control needs.

Core techniques and patterns

Watermarks and checkpoints: last processed timestamp, ID, or log position stored durably
Idempotency: re-running a batch produces the same result (use merge/upsert semantics, deduplication)
Ordering: process changes in commit order to avoid write conflicts
Delete handling: tombstones, soft-deletes, or periodic reconciliation
Schema evolution: detect new columns, default values, and backfill strategy
Backfill and reprocessing: re-run limited windows with safeguards
Late and out-of-order data: windowed processing and sequence checks

Choosing a CDC method

Log-based: lowest latency, captures deletes and before/after values; needs access to database logs
Timestamp/Version-based: easy to start; ensure reliable update timestamps and monotonicity
Change tables/Triggers: explicit change records; adds some write overhead
Message streams: events emitted by source apps; schema discipline required

Worked examples

Example 1: Timestamp-based incremental load with watermark

Scenario: Source table orders has updated_at and soft-deletes via is_deleted.

Read last_watermark from control table
SELECT rows WHERE updated_at > last_watermark AND updated_at <= current_cutoff
MERGE into warehouse.orders on business key order_id: insert new, update changed, set deleted_at if is_deleted=1
Update last_watermark to current_cutoff

Notes

Use a cutoff (e.g., now minus 1 minute) to avoid clock skew
If timestamps can go backward, use log-based CDC or version numbers

Example 2: Log-based CDC to bronze/silver

Scenario: Database emits change events (insert/update/delete) with commit_lsn and row payload.

Ingest raw events to bronze with commit_lsn
Deduplicate by (table, primary_key, commit_lsn)
Apply in commit order to silver tables using MERGE; deletes become tombstones or hard deletes depending on policy
Downstream models (gold) use type-2 dimensions where needed

Notes

Maintain last_commit_lsn checkpoint per table
Ensure exactly-once or effective idempotency via keys and MERGE conditions

Example 3: SCD Type 2 for customer profile

Scenario: Changes to customer tier should preserve history.

Identify changed rows since watermark
For changed keys, close current records (set valid_to = change_time, is_current = 0)
Insert new records with valid_from = change_time, valid_to = null, is_current = 1
Handle deletes by closing records or adding a deleted flag

Tip

Use a hash of business attributes to detect meaningful changes and avoid churn on non-business fields.

Design checklist

Define CDC method per source (log, timestamp, change tables, stream)
Choose business keys and deduplication rules
Define delete policy (tombstone, soft-delete, hard delete)
Store checkpoints per table/partition consistently
Ensure idempotent MERGE/UPSERT semantics
Plan for schema evolution and null/default handling
Set latency targets and acceptable delay buffers
Create backfill and replay procedures
Monitor lag, error rates, and change volume

Exercises

Do these now. Then check your answers below or in the Quick Test.

Exercise 1: Design a CDC plan for a SaaS Orders database. Tables: customers(id, email, status, updated_at), orders(id, customer_id, total, status, updated_at), order_items(id, order_id, sku, qty, updated_at). Requirements: hourly loads to a warehouse, accurate deletes, and ability to backfill one week.
Exercise 2: Write SQL for an incremental MERGE from staging.orders_delta to dw.orders using a watermark table dw.control_watermarks(key, value). Use order_id as key, handle updates and soft-deletes via is_deleted, and advance the watermark safely.

Checklist to self-verify

Did you pick a CDC method per table and justify it?
Is your checkpoint per table and durable?
Can the load be re-run without duplicates or missed rows?
How are deletes represented in targets?
How will you backfill a specific past window?

Common mistakes and self-check

Relying on unreliable timestamps: ensure they update on every change or use log-based CDC
No cutoff window: risk of missing late transactions; use a small delay buffer
Non-idempotent loads: missing MERGE logic or unique keys; test re-runs
Ignoring deletes: leads to inflated metrics; define tombstone/soft-delete rules
Skipping ordering: applying changes out of commit order creates inconsistencies
Forgetting schema evolution: new columns arriving cause failures; use safe defaults

Quick self-audit

Can you stop a job mid-run and safely resume?
If the same change arrives twice, does the result remain correct?
How do you detect and correct gaps in processing?

Practical projects

Build a mini pipeline: source table with updates and deletes, a staging delta table, a MERGE into a target, and a control watermark
Create an SCD2 dimension for customers with valid_from/valid_to
Simulate late-arriving rows and prove your pipeline remains consistent

Mini challenge

Your pipeline processes updates within 5 minutes, but finance reports show occasional double-counting. Hypothesize two root causes related to CDC and describe a concrete test to confirm each.

Possible directions

Duplicate delivery not handled by MERGE key
Out-of-order updates overwriting newer values

Who this is for

Data Architects, Analytics Engineers, and Data Engineers designing reliable, scalable data integration pipelines.

Prerequisites

Comfort with SQL (SELECT, JOIN, MERGE/UPSERT)
Basic understanding of OLTP vs. OLAP
Familiarity with batch vs. streaming concepts

Learning path

Understand CDC methods and choose per source
Design watermarks and idempotent merges
Handle deletes and SCD patterns
Manage schema evolution and backfills
Add monitoring and data quality checks

Next steps

Implement the exercises in a sandbox database
Add alerts for lag and failed merges
Extend to streaming CDC for lower latency

Progress and test

The quick test below is available to everyone. Only logged-in users have their progress saved automatically.

Menu

CDC And Incremental Loads

Table of Contents

Why this matters

Concept explained simply

Mental model

Core techniques and patterns

Worked examples

Example 1: Timestamp-based incremental load with watermark

Example 2: Log-based CDC to bronze/silver

Example 3: SCD Type 2 for customer profile

Design checklist

Exercises

Common mistakes and self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Progress and test

Practice Exercises

Design a CDC plan for a SaaS Orders database

Instructions

Expected Output

Write an incremental MERGE with watermark

CDC And Incremental Loads — Quick Test

Have questions about CDC And Incremental Loads?

AI Assistant