Why this matters
As an Analytics Engineer, you turn messy source data into reliable models for dashboards, reverse ETL, and ML features. Real-world data is noisy: duplicate events, late-arriving updates, and partial records. Deduplication ensures one truth per business key; upserts make your models idempotent and always current. You will use these patterns daily when building staging models, facts, and dimensions.
- Real tasks: Clean event logs to one row per event_id.
- Real tasks: Merge the latest customer attributes into a customer_dim table.
- Real tasks: Incrementally refresh large tables without reprocessing everything.
Concept explained simply
Deduplication means keeping only the one correct row for each natural key (like order_id or user_id + event_time). Upsert means update existing rows if the key already exists, otherwise insert a new row. Together, they make your pipelines correct and repeatable even if the same file or batch lands twice.
Mental model
Think of each natural key as a folder. Many copies try to enter the same folder; dedup picks the best single file to keep. Then upsert places that file into a master cabinet: if a file exists, replace it with the newer one; if not, add it.
Key terms
- Natural key: Business identifier (order_id, email, event_id).
- Surrogate key: Engineered integer or hash used for joins.
- Idempotent: Running the job multiple times produces the same result.
- Tie-breaker: Rule to choose one row when multiple candidates exist (e.g., MAX(updated_at)).
- Change timestamp: Field used to detect the newest version (updated_at, _load_time).
Worked examples
Example 1: Deduplicate orders by latest update
Goal: One row per order_id, keeping the most recent change.
-- Source: orders_raw can contain duplicates and late-arriving rows
WITH ranked AS (
SELECT
o.*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC, ingested_at DESC
) AS rn
FROM orders_raw o
)
SELECT *
FROM ranked
WHERE rn = 1;- Partition: order_id (natural key)
- Tie-breakers: updated_at, then ingested_at
- Result: exactly one row per order_id
Example 2: Deduplicate events keeping the first occurrence
Some event pipelines want the first seen version. Adjust the ORDER BY.
WITH ranked AS (
SELECT
e.*,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY occurred_at ASC, ingested_at ASC
) AS rn
FROM clickstream_raw e
)
SELECT *
FROM ranked
WHERE rn = 1;This pattern is useful when the earliest occurrence is authoritative (e.g., first-touch attribution).
Example 3: Upsert into a dimension using MERGE
Use a MERGE to update existing keys and insert new ones. This ANSI-like form works similarly in many warehouses:
MERGE INTO dim_customer AS t
USING staging_customer_dedup AS s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.updated_at >= t.updated_at THEN
UPDATE SET
name = s.name,
email = s.email,
segment = s.segment,
updated_at = s.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, segment, updated_at)
VALUES (s.customer_id, s.name, s.email, s.segment, s.updated_at);Key points:
- Match on the natural key: customer_id
- Optional safeguard: only update if the incoming record is newer
- Idempotent if staging is deduplicated
Example 4: Incremental dedup with daily partitions
When tables are huge, dedup within the updated date window only.
-- Assume we load only rows where updated_at >= {{max_loaded_updated_at}}
WITH candidate AS (
SELECT *
FROM orders_raw
WHERE updated_at >= {{max_loaded_updated_at}}
), ranked AS (
SELECT
c.*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC, ingested_at DESC
) rn
FROM candidate c
)
SELECT *
FROM ranked
WHERE rn = 1;Then MERGE the result into the target table. Maintain a watermark like max_loaded_updated_at to bound the work.
Practical templates
Template: Keep latest per key
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <change_ts> DESC, <ingest_ts> DESC) rn
FROM <source>
)
SELECT * FROM ranked WHERE rn = 1;Template: Keep first per key
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY <change_ts> ASC, <ingest_ts> ASC) rn
FROM <source>
)
SELECT * FROM ranked WHERE rn = 1;Template: MERGE upsert (Type 1)
MERGE INTO <target> t
USING <staging_dedup> s
ON t.<key> = s.<key>
WHEN MATCHED THEN UPDATE SET
<col1> = s.<col1>,
<col2> = s.<col2>,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (<key>, <col1>, <col2>, updated_at)
VALUES (s.<key>, s.<col1>, s.<col2>, s.updated_at);Exercises you can practice now
These mirror the graded exercises below. Aim for idempotent solutions.
- Exercise 1 (ex1): Deduplicate a user profile stream to one row per user_id, keeping the most recent update. If two rows tie on updated_at, break ties with ingested_at. Output should include user_id, name, email, updated_at.
- Exercise 2 (ex2): Upsert order statuses from a staging table into a target fact table. Update when the incoming updated_at is newer; otherwise skip. Insert if the order_id is new.
Checklist before you run
- Your staging CTE selects only necessary columns.
- Window function uses correct PARTITION BY and ORDER BY.
- Tie-breakers are explicit and deterministic.
- MERGE has a clear ON clause and guarded UPDATE condition.
- Re-running produces the same target table.
Common mistakes and how to self-check
- Missing tie-breakers: If change timestamps are equal, add a secondary sort (ingested_at).
- Dedup after join: Always dedup the source first; joins can multiply duplicates.
- Updating with older data: Add a WHERE or MERGE condition to update only if s.updated_at >= t.updated_at.
- Using surrogate key as the match key: Use the business/natural key for upserts.
- Non-idempotent loads: Ensure staging is deduplicated; avoid INSERT-only into final tables for changing data.
Self-check prompts
- If you run the job twice, does the row count change unexpectedly?
- Can you explain your tie-breaker rules in one sentence?
- Is there exactly one row per natural key in the final table?
Practical projects
- Build a customer_dim Type 1 model: dedup CRM changes, then MERGE into a dimension with audit columns (inserted_at, updated_at).
- Clickstream sessionization: dedup events by event_id, then aggregate sessions per user per day.
- Incremental order fact: daily incremental dedup + MERGE using a watermark and a backfill switch for emergencies.
Who this is for
- Analytics Engineers and BI Developers building reliable datasets.
- Data Analysts who maintain staging models.
- Data Engineers who implement ELT patterns in warehouses.
Prerequisites
- Comfortable with SQL SELECT, JOIN, and window functions.
- Basic understanding of dimensions and facts.
- Familiarity with timestamps and data types.
Learning path
- Master window functions (ROW_NUMBER, RANK).
- Practice deterministic tie-breakers.
- Learn MERGE upserts and incremental loads.
- Add data quality checks (row counts, unique keys).
- Automate with scheduled runs and watermarks.
Next steps
- Apply these templates to your staging layers today.
- Add assertions for key uniqueness in your models.
- Take the quick test to check your understanding.
Mini challenge
You receive the same daily file twice, with overlapping customer rows and mixed updated_at values. Write a two-step plan to ensure the final dimension is correct and idempotent. Hint: Dedup in staging, then guarded MERGE.
Quick Test note
You can take the Quick Test now. Anyone can attempt it for free; only logged-in learners will have their progress saved.