Why this matters
As a BI Analyst, you turn raw data into decisions. Duplicates inflate counts, mislead KPIs, and break joins. Outliers skew averages, distort trends, and can hide real issues or opportunities. Reliable dashboards start with clean inputs.
- Customer analytics: deduplicate emails before counting active users.
- Sales dashboards: check order_id uniqueness to avoid double revenue.
- Operations: flag sensor spikes so SLAs and alerts stay trustworthy.
Who this is for
- BI Analysts and data-focused PMs who own dashboards and reports.
- Junior data practitioners learning practical quality checks.
- Anyone asked to explain unexpected counts or odd spikes.
Prerequisites
- Basic SQL (SELECT, GROUP BY, WHERE).
- Comfort with spreadsheets or a scripting tool (e.g., Python) for quick calculations.
- Understanding of your dataset’s “grain” (what one row represents).
Concept explained simply
Duplicates are rows that represent the same real-world entity more than once (exactly the same row or sharing the same business key like email or order_id). Outliers are values far from the rest of the data; they can be errors or legitimate rare events. You detect both to prevent wrong metrics and to drive better decisions.
Mental model
Think of your dataset as a collection of labeled boxes on shelves. Duplicates are two boxes with the same label; you only want one per label. Outliers are boxes that are too big or too small to fit with the others; you inspect them before deciding to keep or move them.
Core techniques
Checking duplicates
- Exact duplicates: rows identical across all columns.
-- Find exact duplicate rows SELECT *, COUNT(*) AS dup_count FROM table GROUP BY all_columns HAVING COUNT(*) > 1; - Key duplicates: multiple rows share a business key (e.g., order_id, email).
-- Find duplicate keys SELECT key_col, COUNT(*) AS cnt FROM table GROUP BY key_col HAVING COUNT(*) > 1;-- Keep the best record per key (example: latest update) WITH ranked AS ( SELECT *, ROW_NUMBER() OVER( PARTITION BY key_col ORDER BY update_ts DESC ) AS rn FROM table ) SELECT * FROM ranked WHERE rn = 1; -- deduped - Near-duplicates: small differences in text (e.g., "Jon" vs "John"). Use standardization (trim, lower, remove punctuation) and, if needed, fuzzy matching.
Detecting outliers
- IQR method (robust to skew):
- IQR = Q3 − Q1
- Lower fence = Q1 − 1.5 × IQR
- Upper fence = Q3 + 1.5 × IQR
-- SQL (window) example for fences WITH stats AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3 FROM t ) SELECT t.*, CASE WHEN t.value < s.q1 - 1.5*(s.q3 - s.q1) OR t.value > s.q3 + 1.5*(s.q3 - s.q1) THEN 1 ELSE 0 END AS is_outlier FROM t t CROSS JOIN stats s; - Z-score (good for roughly normal data): z = (x − mean) / std. Flag |z| > 3 (common rule).
-- SQL (window) z-score SELECT *, (value - AVG(value) OVER()) / NULLIF(STDDEV_SAMP(value) OVER(),0) AS z FROM t; - Business rules: e.g., order_value must be between 0 and 10,000.
- Time series: compare to rolling median/mean to catch sudden spikes.
Worked examples
Example 1 — SQL: Duplicate orders by order_id
-- 1) Find duplicate order_ids
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- 2) Inspect all rows for duplicate order_ids
SELECT o.*
FROM orders o
JOIN (
SELECT order_id
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1
) d USING(order_id);
-- 3) Deduplicate by latest status_date
WITH ranked AS (
SELECT o.*, ROW_NUMBER() OVER (
PARTITION BY order_id ORDER BY status_date DESC
) AS rn
FROM orders o
)
SELECT * FROM ranked WHERE rn = 1; -- one row per order_id
Result: You get a clean one-row-per-order_id dataset using the most recent record.
Example 2 — Spreadsheet: Duplicate customer emails
- In a helper column: =LOWER(TRIM(A2)) to normalize email.
- Duplicate flag: =IF(COUNTIF(B:B,B2)>1, "Dup", "OK").
- Or get counts with a Pivot Table on normalized email.
Result: Quickly see which emails appear multiple times and decide which record to keep.
Example 3 — Python: IQR outliers for order_value
import pandas as pd
s = pd.Series([12,13,14,15,16,17,18,55,19,20,21,22,100])
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
outliers = s[(s < lower) | (s > upper)]
print(q1, q3, lower, upper)
print(outliers.tolist()) # [55, 100]
Result: 55 and 100 are flagged as outliers.
Step-by-step recipe
- Confirm the grain: what does one row represent? Choose the key that must be unique (e.g., order_id).
- Profile duplicates: count duplicates by key and inspect samples.
- Pick a deduping rule: latest timestamp, highest quality flag, or merge logic.
- Profile outliers: choose IQR or z-score; compute fences; review tails.
- Decide action: keep, cap, correct, or exclude. Document your rule.
- Re-run metrics after cleaning: confirm numbers now make sense.
Why duplicates happen (and how to prevent)
- Multiple data loads: enforce unique constraints or upserts.
- CRM merges: standardize inputs and validate before ingest.
- Manual entry: add validation, dropdowns, and de-dup checks in forms.
Exercises
Practice mirrors the tasks below. Compare your answers with the provided solutions. Use the checklist to self-review.
Exercise 1 — Find duplicate customer emails (SQL)
Dataset: customers(customer_id INT, email TEXT)
-- Sample rows
customer_id | email
------------+-----------------
1 | a@example.com
2 | b@example.com
3 | a@example.com
4 | c@example.com
5 | c@example.com
- Write SQL to return emails that appear more than once with their counts.
- Return all rows for those duplicate emails.
Expected output (counts): a@example.com → 2; c@example.com → 2
Show solution
-- 1) Emails with duplicate counts
SELECT email, COUNT(*) AS cnt
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- 2) All rows with duplicate emails
SELECT c.*
FROM customers c
JOIN (
SELECT email
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
) d USING(email);
Exercise 2 — Flag outliers with IQR
List of order_value: [12, 13, 14, 15, 16, 17, 18, 55, 19, 20, 21, 22, 100]
- Compute Q1, Q3, IQR, lower fence, upper fence.
- Identify outliers.
Expected output: outliers are 55 and 100 (values may vary slightly by percentile method).
Show solution
Approx values:
- Q1 ≈ 15.5, Q3 ≈ 21.5 → IQR ≈ 6.0
- Lower fence ≈ 15.5 − 1.5×6 = 6.5
- Upper fence ≈ 21.5 + 1.5×6 = 30.5
- Outliers: 55, 100
Spreadsheet tips:
- =QUARTILE.INC(range,1) and =QUARTILE.INC(range,3)
- Compute IQR and fences; flag with =IF(OR(x<lower, x>upper),"Outlier","OK")
Self-check checklist
- I can explain the data grain and key used for uniqueness.
- I can count and inspect duplicate keys.
- I can implement a deterministic dedupe rule (e.g., latest record).
- I can compute IQR fences and/or z-scores and flag outliers.
- I document cleaning decisions and re-validate metrics after changes.
Common mistakes and how to self-check
- Using the wrong key for uniqueness. Self-check: Write a one-sentence "row grain" statement and verify with stakeholders.
- Dropping legitimate rare events. Self-check: Sample outliers and annotate cause (error vs true). Keep if business-valid.
- Relying only on average to spot outliers. Self-check: Always view distribution (min, Q1, median, Q3, max).
- Partial dedup without a tie-breaker. Self-check: If two rows have same key and timestamp, define a secondary rule.
- Forgetting to re-check downstream joins. Self-check: After dedupe, re-run joins and confirm row counts and KPIs.
Practical projects
- Customer 360: Build a deduped customer dimension using email and phone with a clear survivorship rule.
- Sales quality monitor: A daily job that counts duplicate order_ids and flags outlier order_values.
- Marketing list hygiene: Standardize and deduplicate campaign contact lists; measure impact on send volume.
Learning path
- Before this: Data profiling basics (distributions, null checks).
- This lesson: Duplicates and outliers—detect, decide, document.
- Next: Handling missing values, standardization, and validation rules across pipelines.
Quick Test
The quick test is available to everyone; only logged-in users get saved progress.
Next steps
- Automate: Turn your duplicate and outlier checks into repeatable queries or notebook cells.
- Document: Add your rules to a shared data quality playbook.
- Monitor: Schedule counts of duplicates and outliers and alert on thresholds.
Mini challenge
You receive a weekly orders extract where order_id should be unique. This week, duplicate order_ids increased from 0.2% to 3%. Draft a 5-step plan to identify the source, stop the growth, and patch existing data without breaking dashboards. Include how you will validate success.