Why this matters
As a Data Architect, you must design pipelines that move only changed data reliably. CDC and incremental loads reduce cost, shrink load windows, enable near real-time analytics, and prevent stale dashboards. Real tasks include planning watermark strategies, choosing between log-based CDC vs. query-based increments, handling deletes, and ensuring idempotent, ordered loads in production.
Typical scenarios you will face
- Migrating an OLTP database to a warehouse without daily full refreshes
- Near real-time event streams feeding operational reporting
- Slowly Changing Dimensions (SCD) for customer profiles
- Backfilling late-arriving events safely
Concept explained simply
Change Data Capture (CDC) means detecting what changed since last time. Incremental loads are the practical process of loading only those changes.
- CDC sources: database logs (e.g., transaction logs), change tables, timestamps, version numbers, or message streams.
- Incremental load: a job that reads changes after a checkpoint (watermark), merges them into targets, and records a new checkpoint.
Mental model
Think of your data flow as a conveyor belt:
- The belt carries only new or changed boxes (rows).
- A gate keeps track of the last box that passed (watermark/checkpoint).
- At the end, a sorter merges boxes into shelves: insert new, update existing, remove deleted.
Deep dive: ETL vs. ELT with CDC
ETL: transform changes before loading to target tables; good for strict governance and curated models. ELT: land raw changes first, then transform inside the warehouse/lake; good for agility and scale. CDC fits both—choose based on cost, latency, and control needs.
Core techniques and patterns
- Watermarks and checkpoints: last processed timestamp, ID, or log position stored durably
- Idempotency: re-running a batch produces the same result (use merge/upsert semantics, deduplication)
- Ordering: process changes in commit order to avoid write conflicts
- Delete handling: tombstones, soft-deletes, or periodic reconciliation
- Schema evolution: detect new columns, default values, and backfill strategy
- Backfill and reprocessing: re-run limited windows with safeguards
- Late and out-of-order data: windowed processing and sequence checks
Choosing a CDC method
- Log-based: lowest latency, captures deletes and before/after values; needs access to database logs
- Timestamp/Version-based: easy to start; ensure reliable update timestamps and monotonicity
- Change tables/Triggers: explicit change records; adds some write overhead
- Message streams: events emitted by source apps; schema discipline required
Worked examples
Example 1: Timestamp-based incremental load with watermark
Scenario: Source table orders has updated_at and soft-deletes via is_deleted.
- Read last_watermark from control table
- SELECT rows WHERE updated_at > last_watermark AND updated_at <= current_cutoff
- MERGE into warehouse.orders on business key order_id: insert new, update changed, set deleted_at if is_deleted=1
- Update last_watermark to current_cutoff
Notes
- Use a cutoff (e.g., now minus 1 minute) to avoid clock skew
- If timestamps can go backward, use log-based CDC or version numbers
Example 2: Log-based CDC to bronze/silver
Scenario: Database emits change events (insert/update/delete) with commit_lsn and row payload.
- Ingest raw events to bronze with commit_lsn
- Deduplicate by (table, primary_key, commit_lsn)
- Apply in commit order to silver tables using MERGE; deletes become tombstones or hard deletes depending on policy
- Downstream models (gold) use type-2 dimensions where needed
Notes
- Maintain last_commit_lsn checkpoint per table
- Ensure exactly-once or effective idempotency via keys and MERGE conditions
Example 3: SCD Type 2 for customer profile
Scenario: Changes to customer tier should preserve history.
- Identify changed rows since watermark
- For changed keys, close current records (set valid_to = change_time, is_current = 0)
- Insert new records with valid_from = change_time, valid_to = null, is_current = 1
- Handle deletes by closing records or adding a deleted flag
Tip
Use a hash of business attributes to detect meaningful changes and avoid churn on non-business fields.
Design checklist
- Define CDC method per source (log, timestamp, change tables, stream)
- Choose business keys and deduplication rules
- Define delete policy (tombstone, soft-delete, hard delete)
- Store checkpoints per table/partition consistently
- Ensure idempotent MERGE/UPSERT semantics
- Plan for schema evolution and null/default handling
- Set latency targets and acceptable delay buffers
- Create backfill and replay procedures
- Monitor lag, error rates, and change volume
Exercises
Do these now. Then check your answers below or in the Quick Test.
- Exercise 1: Design a CDC plan for a SaaS Orders database. Tables: customers(id, email, status, updated_at), orders(id, customer_id, total, status, updated_at), order_items(id, order_id, sku, qty, updated_at). Requirements: hourly loads to a warehouse, accurate deletes, and ability to backfill one week.
- Exercise 2: Write SQL for an incremental MERGE from staging.orders_delta to dw.orders using a watermark table dw.control_watermarks(key, value). Use order_id as key, handle updates and soft-deletes via is_deleted, and advance the watermark safely.
Checklist to self-verify
- Did you pick a CDC method per table and justify it?
- Is your checkpoint per table and durable?
- Can the load be re-run without duplicates or missed rows?
- How are deletes represented in targets?
- How will you backfill a specific past window?
Common mistakes and self-check
- Relying on unreliable timestamps: ensure they update on every change or use log-based CDC
- No cutoff window: risk of missing late transactions; use a small delay buffer
- Non-idempotent loads: missing MERGE logic or unique keys; test re-runs
- Ignoring deletes: leads to inflated metrics; define tombstone/soft-delete rules
- Skipping ordering: applying changes out of commit order creates inconsistencies
- Forgetting schema evolution: new columns arriving cause failures; use safe defaults
Quick self-audit
- Can you stop a job mid-run and safely resume?
- If the same change arrives twice, does the result remain correct?
- How do you detect and correct gaps in processing?
Practical projects
- Build a mini pipeline: source table with updates and deletes, a staging delta table, a MERGE into a target, and a control watermark
- Create an SCD2 dimension for customers with valid_from/valid_to
- Simulate late-arriving rows and prove your pipeline remains consistent
Mini challenge
Your pipeline processes updates within 5 minutes, but finance reports show occasional double-counting. Hypothesize two root causes related to CDC and describe a concrete test to confirm each.
Possible directions
- Duplicate delivery not handled by MERGE key
- Out-of-order updates overwriting newer values
Who this is for
Data Architects, Analytics Engineers, and Data Engineers designing reliable, scalable data integration pipelines.
Prerequisites
- Comfort with SQL (SELECT, JOIN, MERGE/UPSERT)
- Basic understanding of OLTP vs. OLAP
- Familiarity with batch vs. streaming concepts
Learning path
- Understand CDC methods and choose per source
- Design watermarks and idempotent merges
- Handle deletes and SCD patterns
- Manage schema evolution and backfills
- Add monitoring and data quality checks
Next steps
- Implement the exercises in a sandbox database
- Add alerts for lag and failed merges
- Extend to streaming CDC for lower latency
Progress and test
The quick test below is available to everyone. Only logged-in users have their progress saved automatically.