How to learn Deduplication Strategies for ETL Design Patterns in ETL Developer for free

Why this matters

Duplicates creep in from retries, CDC noise, late-arriving files, and multi-source merges. As an ETL Developer, you will:

Load customer, product, and transaction data reliably without double-counting.
Design idempotent pipelines so re-runs don’t inflate metrics.
Merge multiple sources (CRM, web, billing) into a single truth table.
Clean CDC streams that emit the same event more than once.
Enforce uniqueness while keeping the “best” version of a record.

Concept explained simply

Deduplication removes repeated records so downstream analytics and apps see one correct version. There are two main types:

Exact duplicates: same business key and same content (or same business key with a deterministic rule to keep one).
Near-duplicates: similar but not identical (fuzzy matching, e.g., “Jon Smith” vs “John Smith”).

Three decisions define your strategy:

Scope: dedupe within a batch, within a time window, or across full history.
Key: which fields define “same entity”? (e.g., email, customer_id, order_id).
Tie-breaker (survivorship): which record wins? (e.g., latest updated_at, highest version, non-null fields preferred).

Example survivorship rules

Keep most recent by updated_at; break ties by highest version.
Prefer non-null phone/email; otherwise keep earliest created_at.
For orders: keep the record with the latest event_time within 24h.

Mental model

Imagine a stack of index cards per entity (grouped by your business key). Deduplication is choosing one card to represent truth. The rules you set (which stack, how you sort cards, which attributes you keep) determine consistent, repeatable outcomes.

Core patterns you’ll use

1) Unique constraints and MERGE/UPSERT

Use a unique index on the business key and load with MERGE/UPSERT. The database prevents duplicates and applies updates deterministically.

2) SQL windowing (ROW_NUMBER)

For batch staging tables, partition by business key and order by a tie-breaker to keep row_number = 1.

SELECT * FROM (
  SELECT t.*, ROW_NUMBER() OVER (
    PARTITION BY email ORDER BY updated_at DESC, version DESC
  ) AS rn
  FROM staging_customers t
) x WHERE rn = 1;

3) Hash-based dedupe

Create a stable hash of canonicalized fields (trim, lowercase, normalize). Keep one row per (business_key, content_hash) pair.

4) Incremental + idempotent loads

Design loads so you can safely rerun without creating extra rows. Use deterministic keys and MERGE patterns, not append-only inserts.

5) Streaming with event-time + watermark

In streams, duplicates arrive out of order. Use event-time windows, watermarks, and state keyed by business key to drop duplicates within a TTL.

6) CDC awareness

Deduplicate on primary key + source LSN/commit_ts to remove emitter retries. For upserts, keep last operation per key in the batch.

7) Fuzzy/near-duplicate matching

When no reliable key exists (e.g., people without unique IDs), use similarity on names, phones, and addresses. Always log matches and thresholds.

Worked examples

Example 1: Batch dedupe by email with window function

Goal: Keep the most recent customer per email.

-- Staging has multiple rows per email
WITH ranked AS (
  SELECT c.*, ROW_NUMBER() OVER (
    PARTITION BY email ORDER BY updated_at DESC, version DESC
  ) AS rn
  FROM staging_customers c
)
SELECT customer_id, email, first_name, last_name, updated_at
FROM ranked
WHERE rn = 1;

Why it works: We define one winner per email using updated_at and version.

Example 2: Hash-based dedupe across multi-source customer records

Goal: Avoid storing identical content duplicates when merging CRM + Billing.

-- Canonicalize critical fields then hash
SELECT *,
  MD5(CONCAT(LOWER(TRIM(email)), '|', LOWER(TRIM(phone)), '|', LOWER(TRIM(last_name)))) AS content_hash
INTO staging_norm
FROM incoming_customers;

-- Keep one per (business_key, content_hash)
WITH ranked AS (
  SELECT s.*, ROW_NUMBER() OVER (
    PARTITION BY customer_key, content_hash ORDER BY updated_at DESC
  ) AS rn
  FROM staging_norm s
)
SELECT * FROM ranked WHERE rn = 1;

Why it works: Identical normalized content maps to the same hash; we drop duplicates safely.

Example 3: Streaming dedupe with event-time and watermark (concept)

Goal: Drop duplicate order events within 30 minutes using event-time.

-- Pseudo-logic
KeyBy: order_id
State: last_seen_event_time per order_id
Watermark: event_time - 30 minutes
On event e:
  if state[order_id] exists and e.event_time <= state[order_id] + 30m:
    drop (duplicate)
  else:
    emit e
    state[order_id] = max(state[order_id], e.event_time)
State TTL: 2 hours

Why it works: Event-time and watermark handle late events while preventing unbounded state growth.

Step-by-step: designing a dedupe job

Clarify the business key (e.g., email, order_id). If missing, define a composite or use fuzzy matching.
Choose the scope (batch-only, historical, or streaming window).
Decide survivorship rules: how to pick the winner and which attributes to keep.
Select a pattern: MERGE/UPSERT, window function, hash-based, or streaming with watermark.
Implement guardrails: unique constraints, idempotent loads, logging of dropped duplicates.
Validate with counts: total in, total out, duplicates dropped, and spot-check winners.
Automate monitoring: alert on sudden duplicate spikes or rule drift.

Practical projects

Build a batch dedupe pipeline for customers using a window function and survivorship rules.
Implement a MERGE-based idempotent loader into a dimension table.
Create a streaming job that deduplicates order events within 30 minutes using event-time.
Prototype fuzzy matching for leads without reliable IDs; tune thresholds and audit logs.

Common mistakes and self-check

Using processing-time instead of event-time in streams. Self-check: Do late events still dedupe correctly?
No survivorship rules. Self-check: If two records tie, which one wins and why?
Choosing unstable keys (e.g., temporary IDs). Self-check: Does the key persist across systems?
Lack of idempotency. Self-check: Can you re-run yesterday’s load without new duplicates?
Ignoring normalization before hashing. Self-check: Do minor differences in case/whitespace create false negatives?
Unbounded streaming state. Self-check: Are watermarks and TTLs configured?

Exercises

Complete these to build confidence. The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1: Batch dedupe with ROW_NUMBER

Table: staging_customers(email, first_name, last_name, updated_at, version). Keep one row per email: the latest updated_at; if tied, highest version.

Write SQL that returns deduped rows only.
Ensure you can re-run without changing results.

Show a hint

Use ROW_NUMBER partitioned by email; order by updated_at DESC, version DESC; filter rn = 1.

Show solution

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY email ORDER BY updated_at DESC, version DESC
  ) AS rn
  FROM staging_customers
)
SELECT email, first_name, last_name, updated_at, version
FROM ranked WHERE rn = 1;

Exercise 2: Streaming dedupe window

You receive events(order_id, event_time, status). De-duplicate by order_id within 30 minutes using event-time and watermark. Describe the logic or write pseudo-SQL.

Define the key, window, and state/TTL.
Explain what happens to very late events (> 30 minutes).

Show a hint

Key by order_id, maintain last seen event_time, drop if within 30 minutes window; use watermark to bound lateness; add state TTL.

Show solution

-- Pseudo-logic
KeyBy: order_id; Watermark: event_time - 30m; State TTL: 2h
On event e:
  if state[order_id] exists and e.event_time <= state[order_id] + 30m then drop
  else emit and update state[order_id] = max(state[order_id], e.event_time)
Late (> 30m beyond watermark): treat as new (policy-dependent) or route to dead-letter.

Checklist

I defined a clear business key.
I chose a window or historical scope.
I wrote explicit survivorship rules.
My approach is idempotent.
I know how I will validate counts and winners.

Mini challenge

You ingest daily customer snapshots from two sources. Some emails differ in case and contain spaces. Keep one record per email, preferring non-null phone and the latest updated_at. Describe the normalization, the tie-breaker, and how you’d implement it using either MERGE or a window function.

Tip

Normalize email (lowercase, trim), compute a canonical key, then apply a window or MERGE with a unique index on the canonical key.

Who this is for

ETL Developers and Data Engineers building reliable batch/stream pipelines.
Analytics Engineers merging multi-source dimensions.

Prerequisites

Intermediate SQL (window functions, MERGE/UPSERT).
Basic understanding of event-time vs processing-time for streaming.
Familiarity with unique constraints and indexes.

Learning path

Start: Keys and constraints → Window functions → MERGE/UPSERT.
Next: Event-time semantics and watermarks.
Advanced: Fuzzy matching and survivorship design.

Next steps

Instrument dedupe metrics (duplicates dropped, winners by rule).
Create audit logs for dropped duplicates.
Document your keys and survivorship rules so others can reuse them.

Ready for the quick test?

Take the short quiz to check your understanding. The quick test is available to everyone; only logged-in users get saved progress.

Menu

Deduplication Strategies

Table of Contents