Why this matters
Duplicates distort metrics, create bad customer experiences, and waste marketing spend. Data Analysts remove duplicates to ensure accurate reporting and reliable models.
- Customer analytics: One person may appear multiple times due to typos or form variations. Deduping aligns lifetime value and churn rates.
- Sales pipeline: Duplicate leads inflate pipeline and cause double outreach.
- Product analytics: Event duplicates overcount conversions or sessions.
Note: The quick test on this page is available to everyone. Only logged‑in users will see saved progress.
Concept explained simply
A duplicate is a record that represents the same real-world entity as another record according to a rule you define (the dedupe key). Removal means keeping one record per entity and dropping (or merging) the rest.
Mental model
- Define: What makes two rows "the same" here? (Exact match? Same email? Same name+DOB?)
- Normalize: Make comparable (trim, lower-case, standardize dates/phones).
- Group: Partition by dedupe key.
- Select survivor: Keep the "best" row (latest update, most complete).
- Audit: Count before/after, sample check, and log what changed.
Before you dedupe: normalize the data
- Trim spaces: remove leading/trailing and collapse multiple spaces.
- Standardize case: names/emails to lower-case when appropriate.
- Unify nulls: convert empty strings to NULL.
- Standardize dates/phones: consistent formats (e.g., ISO dates, E.164 phones when possible).
- Remove punctuation in keys if it’s irrelevant (e.g., hyphens in phone numbers).
- Canonicalize values: e.g., "St." vs "Street" if used in matching keys.
Why normalization first?
Without normalization, "Alice " and "alice" won’t match. Normalize once, then dedupe with confidence.
Types of duplicates
- Exact row duplicates: Every column identical. Solution: DISTINCT or drop exact duplicates.
- Key duplicates: Rows share an identifier like email or user_id but differ in other fields. Solution: group by the key and select a survivor.
- Near/fuzzy duplicates: Slight differences ("Jon" vs "John", "ACME Ltd" vs "Acme Limited"). Solution: similarity matching and manual review thresholds.
Edge cases to consider
- Reused emails/phones in different people (rare but possible).
- Household vs individual (decide if household should be the entity).
- Different time windows (dedupe within 30 days vs all time).
Worked examples
Example 1 — Excel/Sheets (key duplicates by email)
- Normalize columns with helper formulas:
- Email_clean:
=LOWER(TRIM(A2)) - Name_clean:
=PROPER(TRIM(B2))
- Email_clean:
- Create a composite key if needed:
=Email_cleanor=Email_clean&"|"&Name_clean - Sort by Updated_at descending to prefer the newest record.
- Use a helper flag to mark first occurrence:
=IF(COUNTIF($C$2:C2,C2)=1,1,0)where C is the key column. - Filter rows where flag=1; copy to a clean sheet as deduped output.
Alternative: built-in tools
- Excel: Data → Remove Duplicates (select key columns). Sort first to control which record is kept.
- Sheets: Use
=UNIQUE(range)on the key, thenVLOOKUP/XLOOKUPto pull survivor details.
Example 2 — SQL (keep latest per email)
-- Normalize then dedupe by email; keep the latest update
a WITH normalized AS (
SELECT
LOWER(TRIM(email)) AS email_key,
TRIM(full_name) AS full_name,
phone,
updated_at,
*
FROM raw_leads
), ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email_key ORDER BY updated_at DESC NULLS LAST) AS rn
FROM normalized
)
SELECT * EXCEPT(rn)
FROM ranked
WHERE rn = 1;
Remove exact duplicates quickly
SELECT DISTINCT * FROM some_table;Good for identical rows, but use window functions for "keep best" logic.
Example 3 — Python (pandas)
import pandas as pd
# Load and normalize
df = pd.read_csv('leads.csv')
df['email_key'] = df['email'].str.strip().str.lower()
df['name_clean'] = df['name'].str.strip().str.title()
# Keep the most recently updated row per email
# Sort so the "last" is the newest
df = df.sort_values('updated_at')
dedup = df.drop_duplicates(subset=['email_key'], keep='last')
# Optional: prefer the row with most non-null fields within each group
# dedup = (df.assign(non_nulls=df.notna().sum(axis=1))
# .sort_values(['email_key','non_nulls','updated_at'])
# .drop_duplicates('email_key', keep='last'))
Simple fuzzy flag (near-duplicates) without extra libraries
from difflib import SequenceMatcher
def similar(a,b):
return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()
# Flag pairs in a small dataset with high similarity
pairs = []
for i in range(len(dedup)):
for j in range(i+1, len(dedup)):
r1, r2 = dedup.iloc[i], dedup.iloc[j]
s = similar(r1['name_clean'], r2['name_clean'])
if s >= 0.9:
pairs.append((r1['name_clean'], r2['name_clean'], s))
# Review 'pairs' manually before merging
Use this for candidate flags and manual review when datasets are small.
Choosing the survivor (which record to keep)
- Freshness: highest updated_at or most recent event.
- Completeness: more non-null fields or verified contact info.
- Priority source: CRM over webform, paid source over unknown.
- Tie-breakers: earliest created_at, smallest id.
Merge fields when keeping one row
- Pick non-null values across duplicates: coalesce logic.
- Prefer verified phone/email when available.
-- SQL example: merge best phone
SELECT email_key,
COALESCE(MAX(CASE WHEN phone_verified THEN phone END),
MAX(phone)) AS phone,
MAX_BY(full_name, updated_at) AS full_name -- some SQL dialects
FROM normalized
GROUP BY email_key;
Quality checks and audit
- Counts: rows_before, rows_after, duplicates_removed.
- Spot check: sample 10 grouped duplicates; ensure correct survivor kept.
- Reproducible pipeline: keep dedupe key and rule in code or documented steps.
- Safety: write deduped results to a new table/sheet; never overwrite source until validated.
Quick self-check list
- Did I normalize before grouping?
- Is my key appropriate for this dataset?
- Did I define which record to keep and why?
- Did I verify counts and sample groups?
Exercises (do these, then try the quick test)
These mirror the exercises below so you can practice in your own tools.
- SQL — Keep latest per email. Normalize emails, partition by email, keep the newest based on updated_at. Output the deduped table plus a count of how many were removed.
- Pandas — Clean and dedupe. Lower-case and trim emails, drop duplicates keeping the last by updated_at, and compute how many duplicates you removed.
- Checklist:
- Normalization applied correctly
- Key chosen intentionally
- Survivor logic implemented
- Counts validated
Common mistakes
- Skipping normalization: silent mismatches remain.
- Using DISTINCT when you need "keep best": DISTINCT may drop important data randomly.
- Choosing the wrong key: name alone is risky; prefer stable IDs when possible.
- No audit: not tracking before/after counts makes errors hard to catch.
- Over-aggresive fuzzy rules: merging truly different entities.
How to self-check
- Compare counts by key before and after: any unexpected spikes?
- Manually inspect 10 random duplicate groups and verify survivor logic.
- Run a second pass of duplicate detection to confirm none remain.
Practical projects
- Build a dedupe notebook/script that cleans a leads CSV and outputs a summary report (rows before/after, groups merged).
- Create a SQL view that always returns deduped customers with clear survivor rules.
- Design an Excel template with helper columns and a one-click Remove Duplicates workflow.
Who this is for
- Data Analysts who prepare datasets for reporting and modeling.
- Operations analysts cleaning CRM/marketing exports.
- Anyone consolidating records from multiple sources.
Prerequisites
- Basic SQL or spreadsheets; optional Python/pandas.
- Familiarity with trimming, case conversion, sorting.
- Comfort reading simple window functions (ROW_NUMBER) if using SQL.
Learning path
- Normalize data consistently (text/date/phone).
- Define dedupe keys for your use case.
- Implement survivor logic (freshness, completeness).
- Validate with counts and samples.
- Optional: explore fuzzy matching and review queues.
Mini challenge
You have these rows:
email,name,phone,updated_at
ALICE@EXAMPLE.COM,Alice , ,2024-05-02
alice@example.com, Alice Smith,555-0100,2024-05-05
bob@example.com,Bob,555-2222,2024-05-01
- Task: Normalize emails, dedupe by email, keep the most recent row. Which name and phone survive for Alice? Write down your answer, then implement in your preferred tool.
Expected idea: email_key=alice@example.com; survivor from 2024-05-05 with name "Alice Smith" and phone 555-0100.
Next steps
- Automate your best approach (SQL view/notebook/template) so dedupe is repeatable.
- Document the dedupe rule in your team’s data guide.
- Take the quick test below to check understanding.