Why this matters
As an Analytics Engineer, you turn raw operational data into reliable analytics tables. Change Data Capture (CDC) and change tracking let you move only what changed, detect deletes, and maintain slowly changing dimensions without full reloads. This reduces load on source systems, speeds pipelines, and keeps downstream metrics trustworthy.
- Real tasks you will face: incremental models, handling late-arriving updates, detecting soft/hard deletes, building Type 2 history, and ensuring idempotent (safe to re-run) pipelines.
Concept explained simply
CDC is a way to identify and move only the rows that changed in a source since the last run. You track inserts, updates, and deletes, then apply them to a target table using upserts or merges.
- Timestamp-based CDC: use a column like updated_at and a high-water mark to select changed rows.
- Log-based CDC: read the database’s commit/redo log for precise inserts/updates/deletes with minimal source load.
- Trigger-based CDC: database triggers write changes to an audit table. Higher overhead; ok for small/medium systems.
- Change flags: is_deleted, version, or checksum/hash to detect differences.
- Apply changes with MERGE/UPSERT, ensuring idempotency and deterministic results.
Mental model
Picture a conveyor belt of small packets: each packet says what changed since your last visit. You pick up only the new packets (high-water mark), and you update your warehouse by replacing or adding rows accordingly. If you revisit the belt (re-run a job), you still apply the same state safely (idempotent).
Key terms (expand)
- High-water mark: the last change point processed (e.g., max updated_at).
- Idempotent load: re-running produces the same end state (no double-counting).
- Type 1 vs Type 2: Type 1 overwrites values; Type 2 creates history rows with effective dates.
- Soft delete: mark as deleted (is_deleted = true) instead of removing the row.
Worked examples
Example 1: Timestamp-based incremental extract
Goal: Pull only changed orders since the last run using updated_at and a watermark table.
-- Get last processed timestamp for 'orders'
SELECT last_ts FROM control.watermarks WHERE table_name = 'orders';
-- Incremental extract
SELECT *
FROM src.orders
WHERE updated_at > :last_ts;
-- Update the watermark after successful load
UPDATE control.watermarks
SET last_ts = (SELECT MAX(updated_at) FROM src.orders)
WHERE table_name = 'orders';
Use > (greater than) to avoid reprocessing the row with the exact watermark. If your source can produce equal timestamps at the boundary, include a tie-breaker (e.g., id) or use >= with a de-duplication step on target.
Example 2: Apply changes with MERGE (Type 1 upsert + soft delete)
Goal: Keep dwh.orders in sync with src.orders, respecting soft deletes.
MERGE INTO dwh.orders AS T
USING (
SELECT id, status, amount, updated_at, COALESCE(is_deleted, false) AS is_deleted
FROM staging.orders_changes
) AS S
ON T.id = S.id
WHEN MATCHED AND S.is_deleted = true THEN
UPDATE SET T.is_deleted = true, T.updated_at = S.updated_at
WHEN MATCHED AND S.is_deleted = false THEN
UPDATE SET status = S.status,
amount = S.amount,
updated_at = S.updated_at,
is_deleted = false
WHEN NOT MATCHED AND S.is_deleted = false THEN
INSERT (id, status, amount, updated_at, is_deleted)
VALUES (S.id, S.status, S.amount, S.updated_at, false);
This is idempotent: re-running with the same S set yields the same T state.
Example 3: SCD Type 2 with hash diff
Goal: Track customer attribute history when any tracked attribute changes.
-- Compute hash of tracked attributes
WITH src AS (
SELECT id,
MD5(CONCAT_WS('|', email, plan, country)) AS attr_hash,
email, plan, country, updated_at
FROM staging.customer_changes
),
current AS (
SELECT * FROM dwh.dim_customer WHERE is_current = true
)
MERGE INTO dwh.dim_customer AS T
USING (
SELECT s.id, s.attr_hash, s.email, s.plan, s.country, s.updated_at
FROM src s
) AS S
ON T.customer_id = S.id AND T.is_current = true
WHEN MATCHED AND T.attr_hash <> S.attr_hash THEN
-- Close old row
UPDATE SET is_current = false, effective_to = S.updated_at
WHEN NOT MATCHED BY TARGET THEN
-- New current row
INSERT (customer_id, email, plan, country, attr_hash, is_current, effective_from, effective_to)
VALUES (S.id, S.email, S.plan, S.country, S.attr_hash, true, S.updated_at, TIMESTAMP '9999-12-31');
After closing rows, insert the new current rows for those same ids (some warehouses require two MERGEs or a MERGE followed by an INSERT of changed ids).
Who this is for
- Analytics Engineers and BI Developers building incremental pipelines.
- Data Analysts moving toward ELT/ETL ownership.
Prerequisites
- Comfortable with SQL (SELECT, JOIN, INSERT, UPDATE, MERGE).
- Basic understanding of data warehouses and staging layers.
- Know what Slowly Changing Dimensions are (Type 1 vs Type 2).
Hands-on exercises
Complete the exercises below. A matching solution is provided via expandable details. Use them to check your work, not to skip the thinking.
- Exercise 1: Timestamp + soft-delete CDC for orders (see Exercises section).
- Exercise 2: Hash-diff SCD Type 2 for customers (see Exercises section).
Self-check checklist
- I store and update a high-water mark safely after a successful load.
- I handle deletes (soft or hard) explicitly.
- My MERGE is idempotent and deterministic.
- For SCD2, I detect attribute changes via hash and manage effective dates.
- I prevent duplicates on boundary conditions (tie-breakers or de-dupe).
Common mistakes and how to self-check
- Boundary duplicates: Using >= without de-duplication. Fix: use > or add tie-breaker (updated_at, id).
- Not updating watermark atomically. Fix: update only after target commit or in the same transaction when possible.
- Ignoring deletes. Fix: ingest delete signals (is_deleted, hard delete via anti-join) and apply them.
- Non-idempotent logic. Fix: always MERGE; avoid INSERT-only loads for updated rows.
- Hashing wrong fields. Fix: include only tracked attributes; keep key out of the hash.
Practical projects
- Build an incremental orders fact: Extract via updated_at, stage changes, MERGE into a fact table, and prove idempotency by re-running.
- Create a dim_customer Type 2: Use hash-diff, maintain is_current, effective_from/to, and simulate late-arriving updates.
- Delete detection demo: Simulate soft and hard deletes and ensure downstream tables reflect removals accurately.
Learning path
- Start: Timestamp-based CDC with a robust watermark.
- Add: Delete handling (soft and hard) and idempotent MERGE patterns.
- Advance: SCD Type 2 with hash-diff and late-arriving data handling.
- Explore: Log-based CDC concepts and operational considerations (throughput, ordering, backfills).
Mini challenge
Your product table gets rare updates but frequent price changes. You must keep price history (Type 2) while overwriting non-critical fields (Type 1). Sketch the tables/columns and outline your MERGE strategy in 5–8 steps. Include your watermark strategy and how you’ll handle late updates.
Next steps
- Implement one incremental model in your environment and prove it’s idempotent (three re-runs, identical end state).
- Add automated checks: row counts, max(updated_at) sanity, and delete parity with source.
- Document your CDC contract: which columns drive change, delete semantics, and expected freshness.
Quick Test
Note: The quick test is available to everyone; only logged-in users get saved progress.