Menu

Topic 3 of 12

Duplicate Removal

Learn Duplicate Removal for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Why this matters

Duplicates distort metrics, create bad customer experiences, and waste marketing spend. Data Analysts remove duplicates to ensure accurate reporting and reliable models.

  • Customer analytics: One person may appear multiple times due to typos or form variations. Deduping aligns lifetime value and churn rates.
  • Sales pipeline: Duplicate leads inflate pipeline and cause double outreach.
  • Product analytics: Event duplicates overcount conversions or sessions.

Note: The quick test on this page is available to everyone. Only logged‑in users will see saved progress.

Concept explained simply

A duplicate is a record that represents the same real-world entity as another record according to a rule you define (the dedupe key). Removal means keeping one record per entity and dropping (or merging) the rest.

Mental model

  • Define: What makes two rows "the same" here? (Exact match? Same email? Same name+DOB?)
  • Normalize: Make comparable (trim, lower-case, standardize dates/phones).
  • Group: Partition by dedupe key.
  • Select survivor: Keep the "best" row (latest update, most complete).
  • Audit: Count before/after, sample check, and log what changed.

Before you dedupe: normalize the data

  • Trim spaces: remove leading/trailing and collapse multiple spaces.
  • Standardize case: names/emails to lower-case when appropriate.
  • Unify nulls: convert empty strings to NULL.
  • Standardize dates/phones: consistent formats (e.g., ISO dates, E.164 phones when possible).
  • Remove punctuation in keys if it’s irrelevant (e.g., hyphens in phone numbers).
  • Canonicalize values: e.g., "St." vs "Street" if used in matching keys.
Why normalization first?

Without normalization, "Alice " and "alice" won’t match. Normalize once, then dedupe with confidence.

Types of duplicates

  • Exact row duplicates: Every column identical. Solution: DISTINCT or drop exact duplicates.
  • Key duplicates: Rows share an identifier like email or user_id but differ in other fields. Solution: group by the key and select a survivor.
  • Near/fuzzy duplicates: Slight differences ("Jon" vs "John", "ACME Ltd" vs "Acme Limited"). Solution: similarity matching and manual review thresholds.
Edge cases to consider
  • Reused emails/phones in different people (rare but possible).
  • Household vs individual (decide if household should be the entity).
  • Different time windows (dedupe within 30 days vs all time).

Worked examples

Example 1 — Excel/Sheets (key duplicates by email)

  1. Normalize columns with helper formulas:
    • Email_clean: =LOWER(TRIM(A2))
    • Name_clean: =PROPER(TRIM(B2))
  2. Create a composite key if needed: =Email_clean or =Email_clean&"|"&Name_clean
  3. Sort by Updated_at descending to prefer the newest record.
  4. Use a helper flag to mark first occurrence: =IF(COUNTIF($C$2:C2,C2)=1,1,0) where C is the key column.
  5. Filter rows where flag=1; copy to a clean sheet as deduped output.
Alternative: built-in tools
  • Excel: Data → Remove Duplicates (select key columns). Sort first to control which record is kept.
  • Sheets: Use =UNIQUE(range) on the key, then VLOOKUP/XLOOKUP to pull survivor details.

Example 2 — SQL (keep latest per email)

-- Normalize then dedupe by email; keep the latest update
a WITH normalized AS (
  SELECT 
    LOWER(TRIM(email)) AS email_key,
    TRIM(full_name) AS full_name,
    phone,
    updated_at,
    *
  FROM raw_leads
), ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY email_key ORDER BY updated_at DESC NULLS LAST) AS rn
  FROM normalized
)
SELECT * EXCEPT(rn)
FROM ranked
WHERE rn = 1;
Remove exact duplicates quickly
SELECT DISTINCT * FROM some_table;

Good for identical rows, but use window functions for "keep best" logic.

Example 3 — Python (pandas)

import pandas as pd

# Load and normalize
df = pd.read_csv('leads.csv')
df['email_key'] = df['email'].str.strip().str.lower()
df['name_clean'] = df['name'].str.strip().str.title()

# Keep the most recently updated row per email
# Sort so the "last" is the newest
df = df.sort_values('updated_at')
dedup = df.drop_duplicates(subset=['email_key'], keep='last')

# Optional: prefer the row with most non-null fields within each group
# dedup = (df.assign(non_nulls=df.notna().sum(axis=1))
#            .sort_values(['email_key','non_nulls','updated_at'])
#            .drop_duplicates('email_key', keep='last'))
Simple fuzzy flag (near-duplicates) without extra libraries
from difflib import SequenceMatcher

def similar(a,b):
    return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()

# Flag pairs in a small dataset with high similarity
pairs = []
for i in range(len(dedup)):
    for j in range(i+1, len(dedup)):
        r1, r2 = dedup.iloc[i], dedup.iloc[j]
        s = similar(r1['name_clean'], r2['name_clean'])
        if s >= 0.9:
            pairs.append((r1['name_clean'], r2['name_clean'], s))

# Review 'pairs' manually before merging

Use this for candidate flags and manual review when datasets are small.

Choosing the survivor (which record to keep)

  • Freshness: highest updated_at or most recent event.
  • Completeness: more non-null fields or verified contact info.
  • Priority source: CRM over webform, paid source over unknown.
  • Tie-breakers: earliest created_at, smallest id.
Merge fields when keeping one row
  • Pick non-null values across duplicates: coalesce logic.
  • Prefer verified phone/email when available.
-- SQL example: merge best phone
SELECT email_key,
       COALESCE(MAX(CASE WHEN phone_verified THEN phone END),
                MAX(phone)) AS phone,
       MAX_BY(full_name, updated_at) AS full_name -- some SQL dialects
FROM normalized
GROUP BY email_key;

Quality checks and audit

  • Counts: rows_before, rows_after, duplicates_removed.
  • Spot check: sample 10 grouped duplicates; ensure correct survivor kept.
  • Reproducible pipeline: keep dedupe key and rule in code or documented steps.
  • Safety: write deduped results to a new table/sheet; never overwrite source until validated.
Quick self-check list
  • Did I normalize before grouping?
  • Is my key appropriate for this dataset?
  • Did I define which record to keep and why?
  • Did I verify counts and sample groups?

Exercises (do these, then try the quick test)

These mirror the exercises below so you can practice in your own tools.

  1. SQL — Keep latest per email. Normalize emails, partition by email, keep the newest based on updated_at. Output the deduped table plus a count of how many were removed.
  2. Pandas — Clean and dedupe. Lower-case and trim emails, drop duplicates keeping the last by updated_at, and compute how many duplicates you removed.
  • Checklist:
    • Normalization applied correctly
    • Key chosen intentionally
    • Survivor logic implemented
    • Counts validated

Common mistakes

  • Skipping normalization: silent mismatches remain.
  • Using DISTINCT when you need "keep best": DISTINCT may drop important data randomly.
  • Choosing the wrong key: name alone is risky; prefer stable IDs when possible.
  • No audit: not tracking before/after counts makes errors hard to catch.
  • Over-aggresive fuzzy rules: merging truly different entities.
How to self-check
  • Compare counts by key before and after: any unexpected spikes?
  • Manually inspect 10 random duplicate groups and verify survivor logic.
  • Run a second pass of duplicate detection to confirm none remain.

Practical projects

  • Build a dedupe notebook/script that cleans a leads CSV and outputs a summary report (rows before/after, groups merged).
  • Create a SQL view that always returns deduped customers with clear survivor rules.
  • Design an Excel template with helper columns and a one-click Remove Duplicates workflow.

Who this is for

  • Data Analysts who prepare datasets for reporting and modeling.
  • Operations analysts cleaning CRM/marketing exports.
  • Anyone consolidating records from multiple sources.

Prerequisites

  • Basic SQL or spreadsheets; optional Python/pandas.
  • Familiarity with trimming, case conversion, sorting.
  • Comfort reading simple window functions (ROW_NUMBER) if using SQL.

Learning path

  1. Normalize data consistently (text/date/phone).
  2. Define dedupe keys for your use case.
  3. Implement survivor logic (freshness, completeness).
  4. Validate with counts and samples.
  5. Optional: explore fuzzy matching and review queues.

Mini challenge

You have these rows:

email,name,phone,updated_at
ALICE@EXAMPLE.COM,Alice , ,2024-05-02
alice@example.com, Alice Smith,555-0100,2024-05-05
bob@example.com,Bob,555-2222,2024-05-01
  • Task: Normalize emails, dedupe by email, keep the most recent row. Which name and phone survive for Alice? Write down your answer, then implement in your preferred tool.

Expected idea: email_key=alice@example.com; survivor from 2024-05-05 with name "Alice Smith" and phone 555-0100.

Next steps

  • Automate your best approach (SQL view/notebook/template) so dedupe is repeatable.
  • Document the dedupe rule in your team’s data guide.
  • Take the quick test below to check understanding.

Practice Exercises

2 exercises to complete

Instructions

You have a table raw_leads with columns: id, email, full_name, phone, updated_at. Some emails repeat with different updates.

1) Normalize emails (trim + lower). 2) Partition by normalized email. 3) Keep the most recently updated row per email. 4) Return a deduped table and the number of removed rows.

Sample data
id,email,full_name,phone,updated_at
1,ALICE@EXAMPLE.COM,Alice, ,2024-05-02
2,alice@example.com,Alice Smith,555-0100,2024-05-05
3,bob@example.com,Bob,555-2222,2024-05-01
4,bob@example.com,Bob A.,,2024-05-03
Expected Output
Deduped rows should be the latest per email: id=2 (alice@example.com) and id=4 (bob@example.com). Removed rows count: 2.

Duplicate Removal — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Duplicate Removal?

AI Assistant

Ask questions about this tool