luvv to helpDiscover the Best Free Online Tools
Topic 9 of 10

Deduplication And Upserts

Learn Deduplication And Upserts for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

As an Analytics Engineer, you turn messy source data into reliable models for dashboards, reverse ETL, and ML features. Real-world data is noisy: duplicate events, late-arriving updates, and partial records. Deduplication ensures one truth per business key; upserts make your models idempotent and always current. You will use these patterns daily when building staging models, facts, and dimensions.

  • Real tasks: Clean event logs to one row per event_id.
  • Real tasks: Merge the latest customer attributes into a customer_dim table.
  • Real tasks: Incrementally refresh large tables without reprocessing everything.

Concept explained simply

Deduplication means keeping only the one correct row for each natural key (like order_id or user_id + event_time). Upsert means update existing rows if the key already exists, otherwise insert a new row. Together, they make your pipelines correct and repeatable even if the same file or batch lands twice.

Mental model

Think of each natural key as a folder. Many copies try to enter the same folder; dedup picks the best single file to keep. Then upsert places that file into a master cabinet: if a file exists, replace it with the newer one; if not, add it.

Key terms

  • Natural key: Business identifier (order_id, email, event_id).
  • Surrogate key: Engineered integer or hash used for joins.
  • Idempotent: Running the job multiple times produces the same result.
  • Tie-breaker: Rule to choose one row when multiple candidates exist (e.g., MAX(updated_at)).
  • Change timestamp: Field used to detect the newest version (updated_at, _load_time).

Worked examples

Example 1: Deduplicate orders by latest update

Goal: One row per order_id, keeping the most recent change.

-- Source: orders_raw can contain duplicates and late-arriving rows
WITH ranked AS (
  SELECT
    o.*,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY updated_at DESC, ingested_at DESC
    ) AS rn
  FROM orders_raw o
)
SELECT *
FROM ranked
WHERE rn = 1;
  • Partition: order_id (natural key)
  • Tie-breakers: updated_at, then ingested_at
  • Result: exactly one row per order_id
Example 2: Deduplicate events keeping the first occurrence

Some event pipelines want the first seen version. Adjust the ORDER BY.

WITH ranked AS (
  SELECT
    e.*,
    ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY occurred_at ASC, ingested_at ASC
    ) AS rn
  FROM clickstream_raw e
)
SELECT *
FROM ranked
WHERE rn = 1;

This pattern is useful when the earliest occurrence is authoritative (e.g., first-touch attribution).

Example 3: Upsert into a dimension using MERGE

Use a MERGE to update existing keys and insert new ones. This ANSI-like form works similarly in many warehouses:

MERGE INTO dim_customer AS t
USING staging_customer_dedup AS s
  ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at >= t.updated_at THEN
  UPDATE SET
    name = s.name,
    email = s.email,
    segment = s.segment,
    updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, segment, updated_at)
  VALUES (s.customer_id, s.name, s.email, s.segment, s.updated_at);

Key points:

  • Match on the natural key: customer_id
  • Optional safeguard: only update if the incoming record is newer
  • Idempotent if staging is deduplicated
Example 4: Incremental dedup with daily partitions

When tables are huge, dedup within the updated date window only.

-- Assume we load only rows where updated_at >= {{max_loaded_updated_at}}
WITH candidate AS (
  SELECT *
  FROM orders_raw
  WHERE updated_at >= {{max_loaded_updated_at}}
), ranked AS (
  SELECT
    c.*,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY updated_at DESC, ingested_at DESC
    ) rn
  FROM candidate c
)
SELECT *
FROM ranked
WHERE rn = 1;

Then MERGE the result into the target table. Maintain a watermark like max_loaded_updated_at to bound the work.

Practical templates

Template: Keep latest per key
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <change_ts> DESC, <ingest_ts> DESC) rn
  FROM <source>
)
SELECT * FROM ranked WHERE rn = 1;
Template: Keep first per key
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <change_ts> ASC, <ingest_ts> ASC) rn
  FROM <source>
)
SELECT * FROM ranked WHERE rn = 1;
Template: MERGE upsert (Type 1)
MERGE INTO <target> t
USING <staging_dedup> s
  ON t.<key> = s.<key>
WHEN MATCHED THEN UPDATE SET
  <col1> = s.<col1>,
  <col2> = s.<col2>,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (<key>, <col1>, <col2>, updated_at)
VALUES (s.<key>, s.<col1>, s.<col2>, s.updated_at);

Exercises you can practice now

These mirror the graded exercises below. Aim for idempotent solutions.

  1. Exercise 1 (ex1): Deduplicate a user profile stream to one row per user_id, keeping the most recent update. If two rows tie on updated_at, break ties with ingested_at. Output should include user_id, name, email, updated_at.
  2. Exercise 2 (ex2): Upsert order statuses from a staging table into a target fact table. Update when the incoming updated_at is newer; otherwise skip. Insert if the order_id is new.
Checklist before you run
  • Your staging CTE selects only necessary columns.
  • Window function uses correct PARTITION BY and ORDER BY.
  • Tie-breakers are explicit and deterministic.
  • MERGE has a clear ON clause and guarded UPDATE condition.
  • Re-running produces the same target table.

Common mistakes and how to self-check

  • Missing tie-breakers: If change timestamps are equal, add a secondary sort (ingested_at).
  • Dedup after join: Always dedup the source first; joins can multiply duplicates.
  • Updating with older data: Add a WHERE or MERGE condition to update only if s.updated_at >= t.updated_at.
  • Using surrogate key as the match key: Use the business/natural key for upserts.
  • Non-idempotent loads: Ensure staging is deduplicated; avoid INSERT-only into final tables for changing data.
Self-check prompts
  • If you run the job twice, does the row count change unexpectedly?
  • Can you explain your tie-breaker rules in one sentence?
  • Is there exactly one row per natural key in the final table?

Practical projects

  • Build a customer_dim Type 1 model: dedup CRM changes, then MERGE into a dimension with audit columns (inserted_at, updated_at).
  • Clickstream sessionization: dedup events by event_id, then aggregate sessions per user per day.
  • Incremental order fact: daily incremental dedup + MERGE using a watermark and a backfill switch for emergencies.

Who this is for

  • Analytics Engineers and BI Developers building reliable datasets.
  • Data Analysts who maintain staging models.
  • Data Engineers who implement ELT patterns in warehouses.

Prerequisites

  • Comfortable with SQL SELECT, JOIN, and window functions.
  • Basic understanding of dimensions and facts.
  • Familiarity with timestamps and data types.

Learning path

  1. Master window functions (ROW_NUMBER, RANK).
  2. Practice deterministic tie-breakers.
  3. Learn MERGE upserts and incremental loads.
  4. Add data quality checks (row counts, unique keys).
  5. Automate with scheduled runs and watermarks.

Next steps

  • Apply these templates to your staging layers today.
  • Add assertions for key uniqueness in your models.
  • Take the quick test to check your understanding.

Mini challenge

You receive the same daily file twice, with overlapping customer rows and mixed updated_at values. Write a two-step plan to ensure the final dimension is correct and idempotent. Hint: Dedup in staging, then guarded MERGE.

Quick Test note

You can take the Quick Test now. Anyone can attempt it for free; only logged-in learners will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

You have a table user_profile_raw(user_id, name, email, updated_at, ingested_at). Create a SELECT that returns one row per user_id, keeping the most recent updated_at. If there is a tie on updated_at, keep the row with the greatest ingested_at. Return columns: user_id, name, email, updated_at.

Constraints:

  • Use a window function.
  • Solution must be idempotent.
Expected Output
A result set with exactly one row per user_id, reflecting the most recent user attributes.

Deduplication And Upserts — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Deduplication And Upserts?

AI Assistant

Ask questions about this tool