Menu

Topic 5 of 8

Deduplication And Upserts

Learn Deduplication And Upserts for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

As a Data Engineer, you constantly ingest data that arrives late, duplicated, or corrected. Deduplication prevents double counting and bloated tables. Upserts (update-or-insert) keep your tables accurate without full reloads. You will use these patterns when:

  • Loading CDC (change data capture) events into fact and dimension tables
  • Building idempotent incremental pipelines in warehouses or lakes
  • Maintaining SCD Type 1 dimensions (latest values only)
  • Backfilling late-arriving events and reconciling fixes from source systems

Concept explained simply

Deduplication means picking one “winner” row for each key and dropping the rest, using a clear tie-break rule (like most recent event_ts, highest version, or latest ingest timestamp). Upsert means: if the key exists, update it; if not, insert it.

Mental model

Think of your data like a set of folders labeled by primary key. Deduplication is choosing the single most authoritative document in each folder. Upsert is placing a document into the right folder: replace the old one if it’s newer, otherwise create a new folder.

Key concepts and patterns

  • Business key vs surrogate key: Dedup and upsert typically use a stable business key (e.g., order_id, user_id+date) rather than a surrogate identity.
  • Deterministic tie-breakers: Always define ordering: event_ts (preferred), then version/sequence, then ingest_ts. Explicitly break ties to avoid nondeterminism.
  • Idempotency: Running the same job twice should not change results. Enforce with primary/unique keys, MERGE conditions, and WHERE clauses that only accept newer data.
  • Late-arriving data: Decide whether older events can overwrite newer rows. Many pipelines keep the latest event only (SCD1) and ignore older ones on conflict.
  • Soft deletes: Represent deletes with a flag (is_deleted) or a tombstone event. Your upsert logic should set the flag for matched deletes.
Common SQL strategies

Dedup with window functions:

WITH ranked AS (
  SELECT s.*, ROW_NUMBER() OVER (
    PARTITION BY order_id
    ORDER BY event_ts DESC, version DESC, _ingest_ts DESC
  ) AS rn
  FROM staging_orders s
)
SELECT * FROM ranked WHERE rn = 1;

Group-and-aggregate (when fields are aggregable):

SELECT order_id,
       MAX(event_ts) AS event_ts,
       ANY_VALUE(customer_id) AS customer_id
FROM staging_orders
GROUP BY order_id;

Upsert with MERGE (warehouse syntax):

MERGE INTO dw.orders AS t
USING deduped_source AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.event_ts >= t.event_ts THEN
  UPDATE SET amount = s.amount,
             customer_id = s.customer_id,
             event_ts = s.event_ts,
             updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, amount, event_ts, created_at)
  VALUES (s.order_id, s.customer_id, s.amount, s.event_ts, CURRENT_TIMESTAMP);

PostgreSQL upsert (ON CONFLICT):

INSERT INTO dw.orders (order_id, customer_id, amount, event_ts)
SELECT order_id, customer_id, amount, event_ts
FROM deduped_source
ON CONFLICT (order_id) DO UPDATE
SET customer_id = EXCLUDED.customer_id,
    amount = EXCLUDED.amount,
    event_ts = EXCLUDED.event_ts
WHERE EXCLUDED.event_ts >= dw.orders.event_ts;  -- tie-break

Worked examples

Example 1: Deduplicate events by key + timestamp

Goal: Keep the latest event per event_id using event_ts, then seq as tie-breakers.

WITH ranked AS (
  SELECT e.*,
         ROW_NUMBER() OVER (
           PARTITION BY event_id
           ORDER BY event_ts DESC, seq DESC, _ingest_ts DESC
         ) AS rn
  FROM staging_events e
)
SELECT *
FROM ranked
WHERE rn = 1;

Why it works: ROW_NUMBER keeps one canonical row per key, deterministically choosing the newest.

Example 2: SCD Type 1 upsert for customers

Goal: Update existing customers with the most recent attributes, insert new ones.

MERGE INTO dim_customers AS t
USING (
  SELECT customer_id, email, phone, event_ts,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY event_ts DESC, _ingest_ts DESC
         ) rn
  FROM staging_customers
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.rn = 1 AND s.event_ts >= t.event_ts THEN
  UPDATE SET email = s.email,
             phone = s.phone,
             event_ts = s.event_ts,
             updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED AND s.rn = 1 THEN
  INSERT (customer_id, email, phone, event_ts, created_at)
  VALUES (s.customer_id, s.email, s.phone, s.event_ts, CURRENT_TIMESTAMP);

Example 3: Idempotent micro-batch ingestion

Goal: Process a batch of rows only once even if retried.

-- 1) Load raw into a staging table with a unique constraint on (event_id)
--    Duplicate inserts will be ignored safely.
INSERT INTO staging_events_dedup (event_id, user_id, event_ts, payload)
SELECT event_id, user_id, event_ts, payload
FROM incoming_batch
ON CONFLICT (event_id) DO NOTHING;

-- 2) Upsert from staging into target with a timestamp guard
MERGE INTO fact_events t
USING (
  SELECT * FROM staging_events_dedup
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.event_ts >= t.event_ts THEN UPDATE SET
  user_id = s.user_id,
  event_ts = s.event_ts,
  payload = s.payload,
  updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (event_id, user_id, event_ts, payload, created_at)
VALUES (s.event_id, s.user_id, s.event_ts, s.payload, CURRENT_TIMESTAMP);

Why it works: The unique constraint plus ON CONFLICT prevents double inserts. The MERGE ensures only newer data overwrites.

Step-by-step recipe

  1. Pick a stable key (e.g., order_id). If none exists, build a composite key (e.g., user_id || '-' || event_date || '-' || sequence).
  2. Define a deterministic order for winners: event_ts DESC, then version/seq DESC, then _ingest_ts DESC.
  3. Build a CTE that ranks rows with ROW_NUMBER() OVER (PARTITION BY key ORDER BY ...). Filter rn = 1.
  4. MERGE the deduped CTE into the target with guards preventing older data from overwriting newer.
  5. Add constraints where possible (unique index on key) to enforce idempotency.
Checklist before deploying
  • Unique business key identified
  • Explicit tie-breakers (no implicit ORDER BY)
  • Timezones handled consistently (event_ts normalized to UTC)
  • NULLs considered in ordering (decide how to treat null event_ts)
  • Idempotency verified by rerunning the job with same input
  • MERGE/ON CONFLICT guards prevent older data from overwriting newer
  • Soft deletes handled if your source emits tombstones

Who this is for

  • Data Engineers building incremental pipelines in warehouses or lakes
  • Analytics Engineers designing reliable dimensional models
  • Developers maintaining CDC-based integrations

Prerequisites

  • Comfortable with SQL SELECT, JOIN, GROUP BY
  • Basic knowledge of window functions and primary/unique keys
  • Familiarity with your warehouse/lake MERGE or ON CONFLICT syntax

Learning path

  1. Start: Dedup with window functions (ROW_NUMBER, QUALIFY/DISTINCT ON)
  2. Then: Upserts with MERGE/ON CONFLICT and timestamp/version guards
  3. Next: Incremental models, CDC patterns, and SCD Type 1 vs Type 2
  4. Advanced: Handling late data, soft deletes, and schema evolution

Common mistakes and how to self-check

  • No deterministic tie-breaker: Symptom: row flips between runs. Fix: add explicit ORDER BY with full tie-break chain.
  • Using processing time instead of event time: Leads to wrong winners when late events arrive. Fix: prefer event_ts; use ingest_ts only as final tie-break.
  • Blind overwrites: MERGE updates without comparing timestamps/version. Fix: add WHEN MATCHED AND s.event_ts >= t.event_ts.
  • Forgetting uniqueness constraints: Staging tables accept duplicates. Fix: unique index on key or ON CONFLICT DO NOTHING.
  • Null-order surprises: NULLS FIRST/LAST behavior can vary. Fix: specify NULLS LAST and coalesce where needed.
Self-check
  • Re-run the job twice; row counts and checksums identical?
  • Introduce a late event; does the correct row win per your rule?
  • Simulate a retry; do duplicates appear?

Practical projects

  • Build an order facts pipeline that deduplicates by (order_id) and upserts daily batches.
  • Create a customer SCD1 dimension with email/phone updates and soft-delete handling.
  • Implement a micro-batch CDC loader that is idempotent with ON CONFLICT and MERGE guards.

Exercises

These mirror the exercises below. Try them before opening the solutions.

Exercise 1: Deduplicate then upsert events

Staging table: staging_events(event_id, user_id, event_ts, seq, payload, _ingest_ts). Build a query that selects the single best row per event_id (latest event_ts, then seq, then _ingest_ts). Upsert into fact_events(event_id PK, user_id, event_ts, payload, created_at, updated_at) so only newer data overwrites existing rows.

Hint

Use ROW_NUMBER() in a CTE for dedup, then MERGE with a timestamp guard.

Exercise 2: Customer SCD1 upsert with tie-break

Staging: staging_customers(customer_id, email, event_ts, _ingest_ts). Target: dim_customers(customer_id PK, email, event_ts). Keep the newest email per customer and prevent older updates from overwriting.

Hint

Rank rows per customer_id and only upsert rn=1; compare s.event_ts vs t.event_ts in WHEN MATCHED.

Mini challenge

You discover duplicate order_id values in the target after a retry. What safeguards are missing? Propose a minimal change to make the pipeline idempotent.

Possible answer

Add a unique constraint or primary key on target.order_id and ensure MERGE/ON CONFLICT paths are used; also add a WHERE guard to prevent older data from updating newer rows.

Next steps

  • Extend your MERGE logic to handle soft deletes with an is_deleted flag.
  • Add data quality checks: count distinct keys before/after, and assert no duplicates.
  • Automate backfills with the same idempotent logic.
About the quick test

The quick test is available to everyone. Only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Staging table schema:

  • staging_events(event_id STRING, user_id STRING, event_ts TIMESTAMP, seq INT, payload STRING, _ingest_ts TIMESTAMP)

Target schema:

  • fact_events(event_id STRING PRIMARY KEY, user_id STRING, event_ts TIMESTAMP, payload STRING, created_at TIMESTAMP, updated_at TIMESTAMP)

Task:

  1. Build a CTE that returns one row per event_id using priority: event_ts DESC, then seq DESC, then _ingest_ts DESC.
  2. Upsert into fact_events so that only newer events overwrite existing rows. Insert if not present.
Expected Output
fact_events contains one row per event_id. Updates occur only when the incoming event_ts is newer than the stored event_ts.

Deduplication And Upserts — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Deduplication And Upserts?

AI Assistant

Ask questions about this tool