Why this matters
As a Data Engineer, you constantly ingest data that arrives late, duplicated, or corrected. Deduplication prevents double counting and bloated tables. Upserts (update-or-insert) keep your tables accurate without full reloads. You will use these patterns when:
- Loading CDC (change data capture) events into fact and dimension tables
- Building idempotent incremental pipelines in warehouses or lakes
- Maintaining SCD Type 1 dimensions (latest values only)
- Backfilling late-arriving events and reconciling fixes from source systems
Concept explained simply
Deduplication means picking one “winner” row for each key and dropping the rest, using a clear tie-break rule (like most recent event_ts, highest version, or latest ingest timestamp). Upsert means: if the key exists, update it; if not, insert it.
Mental model
Think of your data like a set of folders labeled by primary key. Deduplication is choosing the single most authoritative document in each folder. Upsert is placing a document into the right folder: replace the old one if it’s newer, otherwise create a new folder.
Key concepts and patterns
- Business key vs surrogate key: Dedup and upsert typically use a stable business key (e.g., order_id, user_id+date) rather than a surrogate identity.
- Deterministic tie-breakers: Always define ordering: event_ts (preferred), then version/sequence, then ingest_ts. Explicitly break ties to avoid nondeterminism.
- Idempotency: Running the same job twice should not change results. Enforce with primary/unique keys, MERGE conditions, and WHERE clauses that only accept newer data.
- Late-arriving data: Decide whether older events can overwrite newer rows. Many pipelines keep the latest event only (SCD1) and ignore older ones on conflict.
- Soft deletes: Represent deletes with a flag (is_deleted) or a tombstone event. Your upsert logic should set the flag for matched deletes.
Common SQL strategies
Dedup with window functions:
WITH ranked AS (
SELECT s.*, ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY event_ts DESC, version DESC, _ingest_ts DESC
) AS rn
FROM staging_orders s
)
SELECT * FROM ranked WHERE rn = 1;
Group-and-aggregate (when fields are aggregable):
SELECT order_id,
MAX(event_ts) AS event_ts,
ANY_VALUE(customer_id) AS customer_id
FROM staging_orders
GROUP BY order_id;
Upsert with MERGE (warehouse syntax):
MERGE INTO dw.orders AS t
USING deduped_source AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.event_ts >= t.event_ts THEN
UPDATE SET amount = s.amount,
customer_id = s.customer_id,
event_ts = s.event_ts,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, event_ts, created_at)
VALUES (s.order_id, s.customer_id, s.amount, s.event_ts, CURRENT_TIMESTAMP);
PostgreSQL upsert (ON CONFLICT):
INSERT INTO dw.orders (order_id, customer_id, amount, event_ts)
SELECT order_id, customer_id, amount, event_ts
FROM deduped_source
ON CONFLICT (order_id) DO UPDATE
SET customer_id = EXCLUDED.customer_id,
amount = EXCLUDED.amount,
event_ts = EXCLUDED.event_ts
WHERE EXCLUDED.event_ts >= dw.orders.event_ts; -- tie-break
Worked examples
Example 1: Deduplicate events by key + timestamp
Goal: Keep the latest event per event_id using event_ts, then seq as tie-breakers.
WITH ranked AS (
SELECT e.*,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY event_ts DESC, seq DESC, _ingest_ts DESC
) AS rn
FROM staging_events e
)
SELECT *
FROM ranked
WHERE rn = 1;
Why it works: ROW_NUMBER keeps one canonical row per key, deterministically choosing the newest.
Example 2: SCD Type 1 upsert for customers
Goal: Update existing customers with the most recent attributes, insert new ones.
MERGE INTO dim_customers AS t
USING (
SELECT customer_id, email, phone, event_ts,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY event_ts DESC, _ingest_ts DESC
) rn
FROM staging_customers
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.rn = 1 AND s.event_ts >= t.event_ts THEN
UPDATE SET email = s.email,
phone = s.phone,
event_ts = s.event_ts,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED AND s.rn = 1 THEN
INSERT (customer_id, email, phone, event_ts, created_at)
VALUES (s.customer_id, s.email, s.phone, s.event_ts, CURRENT_TIMESTAMP);
Example 3: Idempotent micro-batch ingestion
Goal: Process a batch of rows only once even if retried.
-- 1) Load raw into a staging table with a unique constraint on (event_id)
-- Duplicate inserts will be ignored safely.
INSERT INTO staging_events_dedup (event_id, user_id, event_ts, payload)
SELECT event_id, user_id, event_ts, payload
FROM incoming_batch
ON CONFLICT (event_id) DO NOTHING;
-- 2) Upsert from staging into target with a timestamp guard
MERGE INTO fact_events t
USING (
SELECT * FROM staging_events_dedup
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.event_ts >= t.event_ts THEN UPDATE SET
user_id = s.user_id,
event_ts = s.event_ts,
payload = s.payload,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (event_id, user_id, event_ts, payload, created_at)
VALUES (s.event_id, s.user_id, s.event_ts, s.payload, CURRENT_TIMESTAMP);
Why it works: The unique constraint plus ON CONFLICT prevents double inserts. The MERGE ensures only newer data overwrites.
Step-by-step recipe
- Pick a stable key (e.g., order_id). If none exists, build a composite key (e.g., user_id || '-' || event_date || '-' || sequence).
- Define a deterministic order for winners: event_ts DESC, then version/seq DESC, then _ingest_ts DESC.
- Build a CTE that ranks rows with ROW_NUMBER() OVER (PARTITION BY key ORDER BY ...). Filter rn = 1.
- MERGE the deduped CTE into the target with guards preventing older data from overwriting newer.
- Add constraints where possible (unique index on key) to enforce idempotency.
Checklist before deploying
- Unique business key identified
- Explicit tie-breakers (no implicit ORDER BY)
- Timezones handled consistently (event_ts normalized to UTC)
- NULLs considered in ordering (decide how to treat null event_ts)
- Idempotency verified by rerunning the job with same input
- MERGE/ON CONFLICT guards prevent older data from overwriting newer
- Soft deletes handled if your source emits tombstones
Who this is for
- Data Engineers building incremental pipelines in warehouses or lakes
- Analytics Engineers designing reliable dimensional models
- Developers maintaining CDC-based integrations
Prerequisites
- Comfortable with SQL SELECT, JOIN, GROUP BY
- Basic knowledge of window functions and primary/unique keys
- Familiarity with your warehouse/lake MERGE or ON CONFLICT syntax
Learning path
- Start: Dedup with window functions (ROW_NUMBER, QUALIFY/DISTINCT ON)
- Then: Upserts with MERGE/ON CONFLICT and timestamp/version guards
- Next: Incremental models, CDC patterns, and SCD Type 1 vs Type 2
- Advanced: Handling late data, soft deletes, and schema evolution
Common mistakes and how to self-check
- No deterministic tie-breaker: Symptom: row flips between runs. Fix: add explicit ORDER BY with full tie-break chain.
- Using processing time instead of event time: Leads to wrong winners when late events arrive. Fix: prefer event_ts; use ingest_ts only as final tie-break.
- Blind overwrites: MERGE updates without comparing timestamps/version. Fix: add WHEN MATCHED AND s.event_ts >= t.event_ts.
- Forgetting uniqueness constraints: Staging tables accept duplicates. Fix: unique index on key or ON CONFLICT DO NOTHING.
- Null-order surprises: NULLS FIRST/LAST behavior can vary. Fix: specify NULLS LAST and coalesce where needed.
Self-check
- Re-run the job twice; row counts and checksums identical?
- Introduce a late event; does the correct row win per your rule?
- Simulate a retry; do duplicates appear?
Practical projects
- Build an order facts pipeline that deduplicates by (order_id) and upserts daily batches.
- Create a customer SCD1 dimension with email/phone updates and soft-delete handling.
- Implement a micro-batch CDC loader that is idempotent with ON CONFLICT and MERGE guards.
Exercises
These mirror the exercises below. Try them before opening the solutions.
Exercise 1: Deduplicate then upsert events
Staging table: staging_events(event_id, user_id, event_ts, seq, payload, _ingest_ts). Build a query that selects the single best row per event_id (latest event_ts, then seq, then _ingest_ts). Upsert into fact_events(event_id PK, user_id, event_ts, payload, created_at, updated_at) so only newer data overwrites existing rows.
Hint
Use ROW_NUMBER() in a CTE for dedup, then MERGE with a timestamp guard.
Exercise 2: Customer SCD1 upsert with tie-break
Staging: staging_customers(customer_id, email, event_ts, _ingest_ts). Target: dim_customers(customer_id PK, email, event_ts). Keep the newest email per customer and prevent older updates from overwriting.
Hint
Rank rows per customer_id and only upsert rn=1; compare s.event_ts vs t.event_ts in WHEN MATCHED.
Mini challenge
You discover duplicate order_id values in the target after a retry. What safeguards are missing? Propose a minimal change to make the pipeline idempotent.
Possible answer
Add a unique constraint or primary key on target.order_id and ensure MERGE/ON CONFLICT paths are used; also add a WHERE guard to prevent older data from updating newer rows.
Next steps
- Extend your MERGE logic to handle soft deletes with an is_deleted flag.
- Add data quality checks: count distinct keys before/after, and assert no duplicates.
- Automate backfills with the same idempotent logic.
About the quick test
The quick test is available to everyone. Only logged-in users get saved progress.