Topic Not Found

Why this matters

As an Analytics Engineer, you turn messy source data into reliable models for dashboards, reverse ETL, and ML features. Real-world data is noisy: duplicate events, late-arriving updates, and partial records. Deduplication ensures one truth per business key; upserts make your models idempotent and always current. You will use these patterns daily when building staging models, facts, and dimensions.

Real tasks: Clean event logs to one row per event_id.
Real tasks: Merge the latest customer attributes into a customer_dim table.
Real tasks: Incrementally refresh large tables without reprocessing everything.

Concept explained simply

Deduplication means keeping only the one correct row for each natural key (like order_id or user_id + event_time). Upsert means update existing rows if the key already exists, otherwise insert a new row. Together, they make your pipelines correct and repeatable even if the same file or batch lands twice.

Mental model

Think of each natural key as a folder. Many copies try to enter the same folder; dedup picks the best single file to keep. Then upsert places that file into a master cabinet: if a file exists, replace it with the newer one; if not, add it.

Key terms

Natural key: Business identifier (order_id, email, event_id).
Surrogate key: Engineered integer or hash used for joins.
Idempotent: Running the job multiple times produces the same result.
Tie-breaker: Rule to choose one row when multiple candidates exist (e.g., MAX(updated_at)).
Change timestamp: Field used to detect the newest version (updated_at, _load_time).

Worked examples

Example 1: Deduplicate orders by latest update

Goal: One row per order_id, keeping the most recent change.

-- Source: orders_raw can contain duplicates and late-arriving rows
WITH ranked AS (
  SELECT
    o.*,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY updated_at DESC, ingested_at DESC
    ) AS rn
  FROM orders_raw o
)
SELECT *
FROM ranked
WHERE rn = 1;

Partition: order_id (natural key)
Tie-breakers: updated_at, then ingested_at
Result: exactly one row per order_id

Example 2: Deduplicate events keeping the first occurrence

Some event pipelines want the first seen version. Adjust the ORDER BY.

WITH ranked AS (
  SELECT
    e.*,
    ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY occurred_at ASC, ingested_at ASC
    ) AS rn
  FROM clickstream_raw e
)
SELECT *
FROM ranked
WHERE rn = 1;

This pattern is useful when the earliest occurrence is authoritative (e.g., first-touch attribution).

Example 3: Upsert into a dimension using MERGE

Use a MERGE to update existing keys and insert new ones. This ANSI-like form works similarly in many warehouses:

MERGE INTO dim_customer AS t
USING staging_customer_dedup AS s
  ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at >= t.updated_at THEN
  UPDATE SET
    name = s.name,
    email = s.email,
    segment = s.segment,
    updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, segment, updated_at)
  VALUES (s.customer_id, s.name, s.email, s.segment, s.updated_at);

Key points:

Match on the natural key: customer_id
Optional safeguard: only update if the incoming record is newer
Idempotent if staging is deduplicated

Example 4: Incremental dedup with daily partitions

When tables are huge, dedup within the updated date window only.

-- Assume we load only rows where updated_at >= {{max_loaded_updated_at}}
WITH candidate AS (
  SELECT *
  FROM orders_raw
  WHERE updated_at >= {{max_loaded_updated_at}}
), ranked AS (
  SELECT
    c.*,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY updated_at DESC, ingested_at DESC
    ) rn
  FROM candidate c
)
SELECT *
FROM ranked
WHERE rn = 1;

Then MERGE the result into the target table. Maintain a watermark like max_loaded_updated_at to bound the work.

Practical templates

Template: Keep latest per key

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <change_ts> DESC, <ingest_ts> DESC) rn
  FROM <source>
)
SELECT * FROM ranked WHERE rn = 1;

Template: Keep first per key

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <change_ts> ASC, <ingest_ts> ASC) rn
  FROM <source>
)
SELECT * FROM ranked WHERE rn = 1;

Template: MERGE upsert (Type 1)

MERGE INTO <target> t
USING <staging_dedup> s
  ON t.<key> = s.<key>
WHEN MATCHED THEN UPDATE SET
  <col1> = s.<col1>,
  <col2> = s.<col2>,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (<key>, <col1>, <col2>, updated_at)
VALUES (s.<key>, s.<col1>, s.<col2>, s.updated_at);

Exercises you can practice now

These mirror the graded exercises below. Aim for idempotent solutions.

Exercise 1 (ex1): Deduplicate a user profile stream to one row per user_id, keeping the most recent update. If two rows tie on updated_at, break ties with ingested_at. Output should include user_id, name, email, updated_at.
Exercise 2 (ex2): Upsert order statuses from a staging table into a target fact table. Update when the incoming updated_at is newer; otherwise skip. Insert if the order_id is new.

Checklist before you run

Your staging CTE selects only necessary columns.
Window function uses correct PARTITION BY and ORDER BY.
Tie-breakers are explicit and deterministic.
MERGE has a clear ON clause and guarded UPDATE condition.
Re-running produces the same target table.

Common mistakes and how to self-check

Missing tie-breakers: If change timestamps are equal, add a secondary sort (ingested_at).
Dedup after join: Always dedup the source first; joins can multiply duplicates.
Updating with older data: Add a WHERE or MERGE condition to update only if s.updated_at >= t.updated_at.
Using surrogate key as the match key: Use the business/natural key for upserts.
Non-idempotent loads: Ensure staging is deduplicated; avoid INSERT-only into final tables for changing data.

Self-check prompts

If you run the job twice, does the row count change unexpectedly?
Can you explain your tie-breaker rules in one sentence?
Is there exactly one row per natural key in the final table?

Practical projects

Build a customer_dim Type 1 model: dedup CRM changes, then MERGE into a dimension with audit columns (inserted_at, updated_at).
Clickstream sessionization: dedup events by event_id, then aggregate sessions per user per day.
Incremental order fact: daily incremental dedup + MERGE using a watermark and a backfill switch for emergencies.

Who this is for

Analytics Engineers and BI Developers building reliable datasets.
Data Analysts who maintain staging models.
Data Engineers who implement ELT patterns in warehouses.

Prerequisites

Comfortable with SQL SELECT, JOIN, and window functions.
Basic understanding of dimensions and facts.
Familiarity with timestamps and data types.

Learning path

Master window functions (ROW_NUMBER, RANK).
Practice deterministic tie-breakers.
Learn MERGE upserts and incremental loads.
Add data quality checks (row counts, unique keys).
Automate with scheduled runs and watermarks.

Next steps

Apply these templates to your staging layers today.
Add assertions for key uniqueness in your models.
Take the quick test to check your understanding.

Mini challenge

You receive the same daily file twice, with overlapping customer rows and mixed updated_at values. Write a two-step plan to ensure the final dimension is correct and idempotent. Hint: Dedup in staging, then guarded MERGE.

Quick Test note

You can take the Quick Test now. Anyone can attempt it for free; only logged-in learners will have their progress saved.

Menu

Deduplication And Upserts

Table of Contents

Why this matters

Concept explained simply

Mental model

Key terms

Worked examples

Practical templates

Exercises you can practice now

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick Test note

Practice Exercises

Deduplicate user profiles by latest update

Instructions

Expected Output

Upsert order statuses with a guarded MERGE

Deduplication And Upserts — Quick Test

Have questions about Deduplication And Upserts?

AI Assistant