Menu

Topic 5 of 8

CDC And Incremental Loads

Learn CDC And Incremental Loads for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you must design pipelines that move only changed data reliably. CDC and incremental loads reduce cost, shrink load windows, enable near real-time analytics, and prevent stale dashboards. Real tasks include planning watermark strategies, choosing between log-based CDC vs. query-based increments, handling deletes, and ensuring idempotent, ordered loads in production.

Typical scenarios you will face
  • Migrating an OLTP database to a warehouse without daily full refreshes
  • Near real-time event streams feeding operational reporting
  • Slowly Changing Dimensions (SCD) for customer profiles
  • Backfilling late-arriving events safely

Concept explained simply

Change Data Capture (CDC) means detecting what changed since last time. Incremental loads are the practical process of loading only those changes.

  • CDC sources: database logs (e.g., transaction logs), change tables, timestamps, version numbers, or message streams.
  • Incremental load: a job that reads changes after a checkpoint (watermark), merges them into targets, and records a new checkpoint.

Mental model

Think of your data flow as a conveyor belt:

  • The belt carries only new or changed boxes (rows).
  • A gate keeps track of the last box that passed (watermark/checkpoint).
  • At the end, a sorter merges boxes into shelves: insert new, update existing, remove deleted.
Deep dive: ETL vs. ELT with CDC

ETL: transform changes before loading to target tables; good for strict governance and curated models. ELT: land raw changes first, then transform inside the warehouse/lake; good for agility and scale. CDC fits both—choose based on cost, latency, and control needs.

Core techniques and patterns

  • Watermarks and checkpoints: last processed timestamp, ID, or log position stored durably
  • Idempotency: re-running a batch produces the same result (use merge/upsert semantics, deduplication)
  • Ordering: process changes in commit order to avoid write conflicts
  • Delete handling: tombstones, soft-deletes, or periodic reconciliation
  • Schema evolution: detect new columns, default values, and backfill strategy
  • Backfill and reprocessing: re-run limited windows with safeguards
  • Late and out-of-order data: windowed processing and sequence checks
Choosing a CDC method
  • Log-based: lowest latency, captures deletes and before/after values; needs access to database logs
  • Timestamp/Version-based: easy to start; ensure reliable update timestamps and monotonicity
  • Change tables/Triggers: explicit change records; adds some write overhead
  • Message streams: events emitted by source apps; schema discipline required

Worked examples

Example 1: Timestamp-based incremental load with watermark

Scenario: Source table orders has updated_at and soft-deletes via is_deleted.

  1. Read last_watermark from control table
  2. SELECT rows WHERE updated_at > last_watermark AND updated_at <= current_cutoff
  3. MERGE into warehouse.orders on business key order_id: insert new, update changed, set deleted_at if is_deleted=1
  4. Update last_watermark to current_cutoff
Notes
  • Use a cutoff (e.g., now minus 1 minute) to avoid clock skew
  • If timestamps can go backward, use log-based CDC or version numbers

Example 2: Log-based CDC to bronze/silver

Scenario: Database emits change events (insert/update/delete) with commit_lsn and row payload.

  1. Ingest raw events to bronze with commit_lsn
  2. Deduplicate by (table, primary_key, commit_lsn)
  3. Apply in commit order to silver tables using MERGE; deletes become tombstones or hard deletes depending on policy
  4. Downstream models (gold) use type-2 dimensions where needed
Notes
  • Maintain last_commit_lsn checkpoint per table
  • Ensure exactly-once or effective idempotency via keys and MERGE conditions

Example 3: SCD Type 2 for customer profile

Scenario: Changes to customer tier should preserve history.

  1. Identify changed rows since watermark
  2. For changed keys, close current records (set valid_to = change_time, is_current = 0)
  3. Insert new records with valid_from = change_time, valid_to = null, is_current = 1
  4. Handle deletes by closing records or adding a deleted flag
Tip

Use a hash of business attributes to detect meaningful changes and avoid churn on non-business fields.

Design checklist

  • Define CDC method per source (log, timestamp, change tables, stream)
  • Choose business keys and deduplication rules
  • Define delete policy (tombstone, soft-delete, hard delete)
  • Store checkpoints per table/partition consistently
  • Ensure idempotent MERGE/UPSERT semantics
  • Plan for schema evolution and null/default handling
  • Set latency targets and acceptable delay buffers
  • Create backfill and replay procedures
  • Monitor lag, error rates, and change volume

Exercises

Do these now. Then check your answers below or in the Quick Test.

  1. Exercise 1: Design a CDC plan for a SaaS Orders database. Tables: customers(id, email, status, updated_at), orders(id, customer_id, total, status, updated_at), order_items(id, order_id, sku, qty, updated_at). Requirements: hourly loads to a warehouse, accurate deletes, and ability to backfill one week.
  2. Exercise 2: Write SQL for an incremental MERGE from staging.orders_delta to dw.orders using a watermark table dw.control_watermarks(key, value). Use order_id as key, handle updates and soft-deletes via is_deleted, and advance the watermark safely.
Checklist to self-verify
  • Did you pick a CDC method per table and justify it?
  • Is your checkpoint per table and durable?
  • Can the load be re-run without duplicates or missed rows?
  • How are deletes represented in targets?
  • How will you backfill a specific past window?

Common mistakes and self-check

  • Relying on unreliable timestamps: ensure they update on every change or use log-based CDC
  • No cutoff window: risk of missing late transactions; use a small delay buffer
  • Non-idempotent loads: missing MERGE logic or unique keys; test re-runs
  • Ignoring deletes: leads to inflated metrics; define tombstone/soft-delete rules
  • Skipping ordering: applying changes out of commit order creates inconsistencies
  • Forgetting schema evolution: new columns arriving cause failures; use safe defaults
Quick self-audit
  • Can you stop a job mid-run and safely resume?
  • If the same change arrives twice, does the result remain correct?
  • How do you detect and correct gaps in processing?

Practical projects

  • Build a mini pipeline: source table with updates and deletes, a staging delta table, a MERGE into a target, and a control watermark
  • Create an SCD2 dimension for customers with valid_from/valid_to
  • Simulate late-arriving rows and prove your pipeline remains consistent

Mini challenge

Your pipeline processes updates within 5 minutes, but finance reports show occasional double-counting. Hypothesize two root causes related to CDC and describe a concrete test to confirm each.

Possible directions
  • Duplicate delivery not handled by MERGE key
  • Out-of-order updates overwriting newer values

Who this is for

Data Architects, Analytics Engineers, and Data Engineers designing reliable, scalable data integration pipelines.

Prerequisites

  • Comfort with SQL (SELECT, JOIN, MERGE/UPSERT)
  • Basic understanding of OLTP vs. OLAP
  • Familiarity with batch vs. streaming concepts

Learning path

  1. Understand CDC methods and choose per source
  2. Design watermarks and idempotent merges
  3. Handle deletes and SCD patterns
  4. Manage schema evolution and backfills
  5. Add monitoring and data quality checks

Next steps

  • Implement the exercises in a sandbox database
  • Add alerts for lag and failed merges
  • Extend to streaming CDC for lower latency

Progress and test

The quick test below is available to everyone. Only logged-in users have their progress saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Design a CDC approach for three tables: customers(id, email, status, updated_at), orders(id, customer_id, total, status, updated_at), order_items(id, order_id, sku, qty, updated_at). Requirements: hourly loads to a warehouse, accurate deletes, backfill one week if needed. Specify for each table: CDC method, keys, delete handling, watermarking, ordering, and backfill steps.

Expected Output
A concise plan covering CDC method per table, chosen keys, how deletes are represented, checkpoint strategy, and a concrete backfill procedure.

CDC And Incremental Loads — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about CDC And Incremental Loads?

AI Assistant

Ask questions about this tool