Topic Not Found

Why this matters

As a Data Engineer, you constantly ingest data that arrives late, duplicated, or corrected. Deduplication prevents double counting and bloated tables. Upserts (update-or-insert) keep your tables accurate without full reloads. You will use these patterns when:

Loading CDC (change data capture) events into fact and dimension tables
Building idempotent incremental pipelines in warehouses or lakes
Maintaining SCD Type 1 dimensions (latest values only)
Backfilling late-arriving events and reconciling fixes from source systems

Concept explained simply

Deduplication means picking one “winner” row for each key and dropping the rest, using a clear tie-break rule (like most recent event_ts, highest version, or latest ingest timestamp). Upsert means: if the key exists, update it; if not, insert it.

Mental model

Think of your data like a set of folders labeled by primary key. Deduplication is choosing the single most authoritative document in each folder. Upsert is placing a document into the right folder: replace the old one if it’s newer, otherwise create a new folder.

Key concepts and patterns

Business key vs surrogate key: Dedup and upsert typically use a stable business key (e.g., order_id, user_id+date) rather than a surrogate identity.
Deterministic tie-breakers: Always define ordering: event_ts (preferred), then version/sequence, then ingest_ts. Explicitly break ties to avoid nondeterminism.
Idempotency: Running the same job twice should not change results. Enforce with primary/unique keys, MERGE conditions, and WHERE clauses that only accept newer data.
Late-arriving data: Decide whether older events can overwrite newer rows. Many pipelines keep the latest event only (SCD1) and ignore older ones on conflict.
Soft deletes: Represent deletes with a flag (is_deleted) or a tombstone event. Your upsert logic should set the flag for matched deletes.

Common SQL strategies

Dedup with window functions:

WITH ranked AS (
  SELECT s.*, ROW_NUMBER() OVER (
    PARTITION BY order_id
    ORDER BY event_ts DESC, version DESC, _ingest_ts DESC
  ) AS rn
  FROM staging_orders s
)
SELECT * FROM ranked WHERE rn = 1;

Group-and-aggregate (when fields are aggregable):

SELECT order_id,
       MAX(event_ts) AS event_ts,
       ANY_VALUE(customer_id) AS customer_id
FROM staging_orders
GROUP BY order_id;

Upsert with MERGE (warehouse syntax):

MERGE INTO dw.orders AS t
USING deduped_source AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.event_ts >= t.event_ts THEN
  UPDATE SET amount = s.amount,
             customer_id = s.customer_id,
             event_ts = s.event_ts,
             updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, amount, event_ts, created_at)
  VALUES (s.order_id, s.customer_id, s.amount, s.event_ts, CURRENT_TIMESTAMP);

PostgreSQL upsert (ON CONFLICT):

INSERT INTO dw.orders (order_id, customer_id, amount, event_ts)
SELECT order_id, customer_id, amount, event_ts
FROM deduped_source
ON CONFLICT (order_id) DO UPDATE
SET customer_id = EXCLUDED.customer_id,
    amount = EXCLUDED.amount,
    event_ts = EXCLUDED.event_ts
WHERE EXCLUDED.event_ts >= dw.orders.event_ts;  -- tie-break

Worked examples

Example 1: Deduplicate events by key + timestamp

Goal: Keep the latest event per event_id using event_ts, then seq as tie-breakers.

WITH ranked AS (
  SELECT e.*,
         ROW_NUMBER() OVER (
           PARTITION BY event_id
           ORDER BY event_ts DESC, seq DESC, _ingest_ts DESC
         ) AS rn
  FROM staging_events e
)
SELECT *
FROM ranked
WHERE rn = 1;

Why it works: ROW_NUMBER keeps one canonical row per key, deterministically choosing the newest.

Example 2: SCD Type 1 upsert for customers

Goal: Update existing customers with the most recent attributes, insert new ones.

MERGE INTO dim_customers AS t
USING (
  SELECT customer_id, email, phone, event_ts,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY event_ts DESC, _ingest_ts DESC
         ) rn
  FROM staging_customers
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.rn = 1 AND s.event_ts >= t.event_ts THEN
  UPDATE SET email = s.email,
             phone = s.phone,
             event_ts = s.event_ts,
             updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED AND s.rn = 1 THEN
  INSERT (customer_id, email, phone, event_ts, created_at)
  VALUES (s.customer_id, s.email, s.phone, s.event_ts, CURRENT_TIMESTAMP);

Example 3: Idempotent micro-batch ingestion

Goal: Process a batch of rows only once even if retried.

-- 1) Load raw into a staging table with a unique constraint on (event_id)
--    Duplicate inserts will be ignored safely.
INSERT INTO staging_events_dedup (event_id, user_id, event_ts, payload)
SELECT event_id, user_id, event_ts, payload
FROM incoming_batch
ON CONFLICT (event_id) DO NOTHING;

-- 2) Upsert from staging into target with a timestamp guard
MERGE INTO fact_events t
USING (
  SELECT * FROM staging_events_dedup
) s
ON t.event_id = s.event_id
WHEN MATCHED AND s.event_ts >= t.event_ts THEN UPDATE SET
  user_id = s.user_id,
  event_ts = s.event_ts,
  payload = s.payload,
  updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (event_id, user_id, event_ts, payload, created_at)
VALUES (s.event_id, s.user_id, s.event_ts, s.payload, CURRENT_TIMESTAMP);

Why it works: The unique constraint plus ON CONFLICT prevents double inserts. The MERGE ensures only newer data overwrites.

Step-by-step recipe

Pick a stable key (e.g., order_id). If none exists, build a composite key (e.g., user_id || '-' || event_date || '-' || sequence).
Define a deterministic order for winners: event_ts DESC, then version/seq DESC, then _ingest_ts DESC.
Build a CTE that ranks rows with ROW_NUMBER() OVER (PARTITION BY key ORDER BY ...). Filter rn = 1.
MERGE the deduped CTE into the target with guards preventing older data from overwriting newer.
Add constraints where possible (unique index on key) to enforce idempotency.

Checklist before deploying

Unique business key identified
Explicit tie-breakers (no implicit ORDER BY)
Timezones handled consistently (event_ts normalized to UTC)
NULLs considered in ordering (decide how to treat null event_ts)
Idempotency verified by rerunning the job with same input
MERGE/ON CONFLICT guards prevent older data from overwriting newer
Soft deletes handled if your source emits tombstones

Who this is for

Data Engineers building incremental pipelines in warehouses or lakes
Analytics Engineers designing reliable dimensional models
Developers maintaining CDC-based integrations

Prerequisites

Comfortable with SQL SELECT, JOIN, GROUP BY
Basic knowledge of window functions and primary/unique keys
Familiarity with your warehouse/lake MERGE or ON CONFLICT syntax

Learning path

Start: Dedup with window functions (ROW_NUMBER, QUALIFY/DISTINCT ON)
Then: Upserts with MERGE/ON CONFLICT and timestamp/version guards
Next: Incremental models, CDC patterns, and SCD Type 1 vs Type 2
Advanced: Handling late data, soft deletes, and schema evolution

Common mistakes and how to self-check

No deterministic tie-breaker: Symptom: row flips between runs. Fix: add explicit ORDER BY with full tie-break chain.
Using processing time instead of event time: Leads to wrong winners when late events arrive. Fix: prefer event_ts; use ingest_ts only as final tie-break.
Blind overwrites: MERGE updates without comparing timestamps/version. Fix: add WHEN MATCHED AND s.event_ts >= t.event_ts.
Forgetting uniqueness constraints: Staging tables accept duplicates. Fix: unique index on key or ON CONFLICT DO NOTHING.
Null-order surprises: NULLS FIRST/LAST behavior can vary. Fix: specify NULLS LAST and coalesce where needed.

Self-check

Re-run the job twice; row counts and checksums identical?
Introduce a late event; does the correct row win per your rule?
Simulate a retry; do duplicates appear?

Practical projects

Build an order facts pipeline that deduplicates by (order_id) and upserts daily batches.
Create a customer SCD1 dimension with email/phone updates and soft-delete handling.
Implement a micro-batch CDC loader that is idempotent with ON CONFLICT and MERGE guards.

Exercises

These mirror the exercises below. Try them before opening the solutions.

Exercise 1: Deduplicate then upsert events

Staging table: staging_events(event_id, user_id, event_ts, seq, payload, _ingest_ts). Build a query that selects the single best row per event_id (latest event_ts, then seq, then _ingest_ts). Upsert into fact_events(event_id PK, user_id, event_ts, payload, created_at, updated_at) so only newer data overwrites existing rows.

Hint

Use ROW_NUMBER() in a CTE for dedup, then MERGE with a timestamp guard.

Exercise 2: Customer SCD1 upsert with tie-break

Staging: staging_customers(customer_id, email, event_ts, _ingest_ts). Target: dim_customers(customer_id PK, email, event_ts). Keep the newest email per customer and prevent older updates from overwriting.

Hint

Rank rows per customer_id and only upsert rn=1; compare s.event_ts vs t.event_ts in WHEN MATCHED.

Mini challenge

You discover duplicate order_id values in the target after a retry. What safeguards are missing? Propose a minimal change to make the pipeline idempotent.

Possible answer

Add a unique constraint or primary key on target.order_id and ensure MERGE/ON CONFLICT paths are used; also add a WHERE guard to prevent older data from updating newer rows.

Next steps

Extend your MERGE logic to handle soft deletes with an is_deleted flag.
Add data quality checks: count distinct keys before/after, and assert no duplicates.
Automate backfills with the same idempotent logic.

About the quick test

The quick test is available to everyone. Only logged-in users get saved progress.

Menu

Deduplication And Upserts

Table of Contents