Why this matters
Upserts and merge patterns let you keep warehouse tables perfectly in sync with changing sources. As an ETL Developer, you will:
- Refresh dimensions and facts without duplicating rows.
- Apply change data capture (CDC) safely and idempotently.
- Handle late-arriving data, deletes, and schema changes.
- Guarantee reproducible results across reruns and retries.
Real tasks include building Slowly Changing Dimensions (SCD), syncing operational tables nightly, merging CDC streams, and backfilling historical adjustments without corrupting data.
Concept explained simply
Upsert = Update existing rows that match keys, Insert new rows that do not. Merge = A single statement (or pattern) that handles updates, inserts, and sometimes deletes.
Mental model
Think in four buckets after you stage and deduplicate your source:
- New: key not in target → insert.
- Changed: key in target and data differs → update (SCD1) or close current row and insert a new version (SCD2).
- Unchanged: key in target and data same → do nothing.
- Deleted: key missing in source (or flagged by CDC) → soft delete or end-date.
Tip: Diff safely
Create a hash of business columns (e.g., SHA256 of name|email|status). Compare hash to detect changes fast. This avoids row-by-row comparisons and floating-point pitfalls.
Core patterns you will use
1) Stage-then-merge
Load source into a staging table, deduplicate by business key and a tie-breaker (e.g., updated_at or sequence), then MERGE into target. This enables auditing, retries, and predictable results.
2) SCD Type 1 (overwrite)
On change: update the existing row. Use when history is not needed (e.g., correction of typos).
3) SCD Type 2 (history)
On change: end-date the current row (set valid_to) and insert a new row (valid_from now). Keep a current_flag for easy queries.
4) Soft deletes
When source deletes a record or emits a tombstone, set current_flag = false or end-date the row. Preserve history for accurate reporting.
5) Idempotency
Design so re-running the same batch produces the same final table. Use a batch_id, run_ts, or source watermarks to deduplicate.
When to avoid vendor-specific MERGE
Some engines have quirks. If MERGE is unreliable in your engine, implement the logic using separate INSERT/UPDATE statements against staged, deduped data. The pattern matters more than the syntax.
Worked examples
Example 1: SCD Type 1 merge (overwrite changes)
-- Assumptions
-- stg_customers: (cust_id, name, email, updated_at, src_hash)
-- dim_customer_scd1: (cust_id PK, name, email, updated_at, src_hash)
-- 1) Stage is already deduplicated by (cust_id ORDER BY updated_at DESC)
-- 2) Merge new and changed rows
MERGE INTO dim_customer_scd1 t
USING (
SELECT cust_id, name, email, updated_at, src_hash FROM stg_customers
) s
ON t.cust_id = s.cust_id
WHEN MATCHED AND t.src_hash <> s.src_hash THEN
UPDATE SET
t.name = s.name,
t.email = s.email,
t.updated_at = s.updated_at,
t.src_hash = s.src_hash
WHEN NOT MATCHED THEN
INSERT (cust_id, name, email, updated_at, src_hash)
VALUES (s.cust_id, s.name, s.email, s.updated_at, s.src_hash);
Why this works
Comparing hashes avoids false mismatches. Using updated_at ensures newest source wins during dedup.
Example 2: SCD Type 2 merge (history)
-- Assumptions
-- stg_customers: (cust_id, name, email, updated_at, src_hash)
-- dim_customer_scd2:
-- (cust_sk surrogate key, cust_id, name, email,
-- valid_from, valid_to, current_flag, src_hash)
-- close_ts = batch run timestamp
-- Close changed current rows
UPDATE dim_customer_scd2 t
SET valid_to = :close_ts,
current_flag = false
WHERE t.current_flag = true
AND EXISTS (
SELECT 1 FROM stg_customers s
WHERE s.cust_id = t.cust_id
AND s.src_hash <> t.src_hash
);
-- Insert new current versions (new keys or changed rows)
INSERT INTO dim_customer_scd2 (cust_id, name, email, valid_from, valid_to, current_flag, src_hash)
SELECT s.cust_id, s.name, s.email, :close_ts AS valid_from, TIMESTAMP '9999-12-31' AS valid_to,
true AS current_flag, s.src_hash
FROM stg_customers s
LEFT JOIN dim_customer_scd2 t
ON t.cust_id = s.cust_id AND t.current_flag = true
WHERE t.cust_id IS NULL -- brand new key
OR t.src_hash <> s.src_hash; -- changed content
Notes
- Use a far-future valid_to sentinel for open rows.
- Batch timestamp guarantees deterministic windows.
- Queries filter current_flag = true for the latest version.
Example 3: CDC upsert with tombstones (soft deletes)
-- Assumptions
-- cdc_orders: (order_id, status, amount, op, op_ts)
-- op in ('I','U','D') and may contain duplicates/out-of-order events
-- orders_current: (order_id PK, status, amount, is_deleted, last_op_ts)
-- 1) Deduplicate CDC by (order_id ORDER BY op_ts DESC)
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY op_ts DESC) AS rn
FROM cdc_orders
), last_event AS (
SELECT order_id, status, amount, op, op_ts FROM ranked WHERE rn = 1
)
-- 2) Apply upsert + soft delete
MERGE INTO orders_current t
USING last_event s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op <> 'D' AND s.op_ts > t.last_op_ts THEN
UPDATE SET t.status = s.status,
t.amount = s.amount,
t.is_deleted = false,
t.last_op_ts = s.op_ts
WHEN MATCHED AND s.op = 'D' AND s.op_ts > t.last_op_ts THEN
UPDATE SET t.is_deleted = true,
t.last_op_ts = s.op_ts
WHEN NOT MATCHED AND s.op <> 'D' THEN
INSERT (order_id, status, amount, is_deleted, last_op_ts)
VALUES (s.order_id, s.status, s.amount, false, s.op_ts);
Why use last_op_ts
It prevents older CDC events from overwriting newer state when data arrives late or out of order.
Design checklist
- Choose a stable business key (or composite) for matching.
- Deduplicate source by key with a deterministic tie-breaker (updated_at, sequence, op_ts).
- Compute a hash of tracked columns for change detection.
- Prefer stage-then-merge to simplify auditing and retries.
- Decide SCD type per table and document it.
- Handle deletes explicitly (soft delete or end-dating).
- Make runs idempotent via batch_id and watermarks.
- Record metadata: load_ts, batch_id, source_system.
Exercises
Do these in your SQL engine of choice. If you cannot execute, write the SQL you would run and reason about the outcome. Everyone can attempt; if you log in, your progress will be saved.
Exercise 1: SCD Type 1 upsert
Given tables stg_customers(cust_id, name, email, updated_at, src_hash) and dim_customer_scd1(cust_id, name, email, updated_at, src_hash), write a single MERGE (or equivalent) that inserts new customers and overwrites changed ones using src_hash to detect change.
Expected outcome
dim_customer_scd1 contains all cust_id from stg with the latest values; unchanged rows are untouched; changed rows have updated columns and src_hash.
Hints
- Join on cust_id.
- Update only when hashes differ.
- Insert when not matched.
Show solution
MERGE INTO dim_customer_scd1 t
USING (SELECT * FROM stg_customers) s
ON t.cust_id = s.cust_id
WHEN MATCHED AND t.src_hash <> s.src_hash THEN
UPDATE SET name = s.name, email = s.email, updated_at = s.updated_at, src_hash = s.src_hash
WHEN NOT MATCHED THEN
INSERT (cust_id, name, email, updated_at, src_hash)
VALUES (s.cust_id, s.name, s.email, s.updated_at, s.src_hash);
Exercise 2: SCD Type 2 with soft deletes
Given stg_products(prod_id, name, price, updated_at, src_hash, is_deleted) and dim_product_scd2(prod_sk, prod_id, name, price, valid_from, valid_to, current_flag, src_hash), write SQL that:
- End-dates current rows that changed.
- Inserts a new current row for changed or new prod_id.
- End-dates rows when stg marks is_deleted = true (do not insert a new version).
Expected outcome
Each prod_id has at most one current row. Deleted products have no current row (all versions are closed).
Hints
- Use a batch timestamp for valid_from/valid_to.
- Separate UPDATE (close) from INSERT (new versions).
- Skip INSERT when is_deleted = true.
Show solution
-- close changed or deleted current rows
UPDATE dim_product_scd2 t
SET valid_to = :batch_ts, current_flag = false
WHERE t.current_flag = true
AND EXISTS (
SELECT 1 FROM stg_products s
WHERE s.prod_id = t.prod_id
AND (s.is_deleted = true OR s.src_hash <> t.src_hash)
);
-- insert new versions for new or changed rows only (not deleted)
INSERT INTO dim_product_scd2 (prod_id, name, price, valid_from, valid_to, current_flag, src_hash)
SELECT s.prod_id, s.name, s.price, :batch_ts, TIMESTAMP '9999-12-31', true, s.src_hash
FROM stg_products s
LEFT JOIN dim_product_scd2 t
ON t.prod_id = s.prod_id AND t.current_flag = true
WHERE s.is_deleted = false
AND (t.prod_id IS NULL OR t.src_hash <> s.src_hash);
Self-check checklist
- I deduplicated stage inputs before merging.
- My change detection is hash-based or compares only tracked columns.
- My logic is idempotent if rerun with the same batch.
- Deletes are handled explicitly (soft delete or end-date).
Common mistakes and how to self-check
- Mistake: Comparing every column including metadata like load_ts. Fix: Hash or compare only business columns.
- Mistake: Not deduplicating source. Fix: Rank by key and watermark; keep rn = 1.
- Mistake: Using non-deterministic tie-breakers. Fix: Use stable updated_at or sequence; if ties, prefer max(src_hash) as secondary key.
- Mistake: Ignoring deletes. Fix: Implement soft deletes or end-date logic.
- Mistake: Relying on transaction order in streams. Fix: Guard with last_op_ts on target to prevent older events overwriting newer state.
- Mistake: MERGE without change filter. Fix: Update only when data truly changed to avoid unnecessary churn and locks.
Practical projects
- Build a Customer SCD2 dimension from daily CSV drops. Include soft deletes and a run manifest table with batch_id and row counts.
- Create a CDC consumer table for Orders that merges inserts, updates, and deletes. Demonstrate late-arrival handling with last_op_ts.
- Migrate an existing Type 1 dimension to Type 2, preserving current values and backfilling valid_from from historical snapshots.
Who this is for
- ETL Developers building warehouse tables and data marts.
- Data Engineers integrating multiple systems with changing data.
- Analytics Engineers maintaining dimensions and facts.
Prerequisites
- Comfort with SQL (JOINs, window functions, INSERT/UPDATE).
- Understanding of keys (natural vs surrogate) and NULL semantics.
- Basic knowledge of staging areas and batch processing.
Learning path
- Review keys, deduplication, and hashing for change detection.
- Practice SCD Type 1 merges on a small dimension.
- Advance to SCD Type 2 with valid_from/valid_to windows.
- Add CDC handling: tie-breakers, tombstones, and idempotency.
- Harden with metadata, audit tables, and data quality checks.
Next steps
- Implement the two exercises on a sandbox dataset.
- Add unit checks: row counts, changed rows, and expected soft-deletes.
- When ready, take the quick test below. Everyone can take it for free; log in to save progress.
Mini challenge
Design an upsert pipeline for a Products table with late-arriving changes and occasional deletes. Requirements:
- Type 2 history for price and name.
- Soft delete when a tombstone event arrives.
- Idempotent for reruns of the same batch.
- Include a data quality check: number of changed rows must be within 20% of total rows unless a high-change flag is set.
What to submit
A short design note describing keys, staging dedup, merge logic, and checks. Optionally include SQL.