How to learn Checking Duplicates And Outliers for Data Quality Checks in BI Analyst for free

Why this matters

As a BI Analyst, you turn raw data into decisions. Duplicates inflate counts, mislead KPIs, and break joins. Outliers skew averages, distort trends, and can hide real issues or opportunities. Reliable dashboards start with clean inputs.

Customer analytics: deduplicate emails before counting active users.
Sales dashboards: check order_id uniqueness to avoid double revenue.
Operations: flag sensor spikes so SLAs and alerts stay trustworthy.

Who this is for

BI Analysts and data-focused PMs who own dashboards and reports.
Junior data practitioners learning practical quality checks.
Anyone asked to explain unexpected counts or odd spikes.

Prerequisites

Basic SQL (SELECT, GROUP BY, WHERE).
Comfort with spreadsheets or a scripting tool (e.g., Python) for quick calculations.
Understanding of your dataset’s “grain” (what one row represents).

Concept explained simply

Duplicates are rows that represent the same real-world entity more than once (exactly the same row or sharing the same business key like email or order_id). Outliers are values far from the rest of the data; they can be errors or legitimate rare events. You detect both to prevent wrong metrics and to drive better decisions.

Mental model

Think of your dataset as a collection of labeled boxes on shelves. Duplicates are two boxes with the same label; you only want one per label. Outliers are boxes that are too big or too small to fit with the others; you inspect them before deciding to keep or move them.

Core techniques

Checking duplicates

Exact duplicates: rows identical across all columns.

-- Find exact duplicate rows
SELECT *, COUNT(*) AS dup_count
FROM table
GROUP BY all_columns
HAVING COUNT(*) > 1;

Key duplicates: multiple rows share a business key (e.g., order_id, email).

-- Find duplicate keys
SELECT key_col, COUNT(*) AS cnt
FROM table
GROUP BY key_col
HAVING COUNT(*) > 1;

-- Keep the best record per key (example: latest update)
WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER(
    PARTITION BY key_col ORDER BY update_ts DESC
  ) AS rn
  FROM table
)
SELECT * FROM ranked WHERE rn = 1; -- deduped

Near-duplicates: small differences in text (e.g., "Jon" vs "John"). Use standardization (trim, lower, remove punctuation) and, if needed, fuzzy matching.

Detecting outliers

IQR method (robust to skew):

IQR = Q3 − Q1
Lower fence = Q1 − 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR

-- SQL (window) example for fences
WITH stats AS (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
  FROM t
)
SELECT t.*, 
       CASE WHEN t.value < s.q1 - 1.5*(s.q3 - s.q1)
                 OR t.value > s.q3 + 1.5*(s.q3 - s.q1)
            THEN 1 ELSE 0 END AS is_outlier
FROM t t CROSS JOIN stats s;

Z-score (good for roughly normal data): z = (x − mean) / std. Flag |z| > 3 (common rule).

-- SQL (window) z-score
SELECT *,
       (value - AVG(value) OVER()) / NULLIF(STDDEV_SAMP(value) OVER(),0) AS z
FROM t;

Business rules: e.g., order_value must be between 0 and 10,000.
Time series: compare to rolling median/mean to catch sudden spikes.

Worked examples

Example 1 — SQL: Duplicate orders by order_id

-- 1) Find duplicate order_ids
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- 2) Inspect all rows for duplicate order_ids
SELECT o.*
FROM orders o
JOIN (
  SELECT order_id
  FROM orders
  GROUP BY order_id
  HAVING COUNT(*) > 1
) d USING(order_id);

-- 3) Deduplicate by latest status_date
WITH ranked AS (
  SELECT o.*, ROW_NUMBER() OVER (
    PARTITION BY order_id ORDER BY status_date DESC
  ) AS rn
  FROM orders o
)
SELECT * FROM ranked WHERE rn = 1; -- one row per order_id

Result: You get a clean one-row-per-order_id dataset using the most recent record.

Example 2 — Spreadsheet: Duplicate customer emails

In a helper column: =LOWER(TRIM(A2)) to normalize email.
Duplicate flag: =IF(COUNTIF(B:B,B2)>1, "Dup", "OK").
Or get counts with a Pivot Table on normalized email.

Result: Quickly see which emails appear multiple times and decide which record to keep.

Example 3 — Python: IQR outliers for order_value

import pandas as pd
s = pd.Series([12,13,14,15,16,17,18,55,19,20,21,22,100])
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
outliers = s[(s < lower) | (s > upper)]
print(q1, q3, lower, upper)
print(outliers.tolist())  # [55, 100]

Result: 55 and 100 are flagged as outliers.

Step-by-step recipe

Confirm the grain: what does one row represent? Choose the key that must be unique (e.g., order_id).
Profile duplicates: count duplicates by key and inspect samples.
Pick a deduping rule: latest timestamp, highest quality flag, or merge logic.
Profile outliers: choose IQR or z-score; compute fences; review tails.
Decide action: keep, cap, correct, or exclude. Document your rule.
Re-run metrics after cleaning: confirm numbers now make sense.

Why duplicates happen (and how to prevent)

Multiple data loads: enforce unique constraints or upserts.
CRM merges: standardize inputs and validate before ingest.
Manual entry: add validation, dropdowns, and de-dup checks in forms.

Exercises

Practice mirrors the tasks below. Compare your answers with the provided solutions. Use the checklist to self-review.

Exercise 1 — Find duplicate customer emails (SQL)

Dataset: customers(customer_id INT, email TEXT)

-- Sample rows
customer_id | email
------------+-----------------
1           | a@example.com
2           | b@example.com
3           | a@example.com
4           | c@example.com
5           | c@example.com

Write SQL to return emails that appear more than once with their counts.
Return all rows for those duplicate emails.

Expected output (counts): a@example.com → 2; c@example.com → 2

Show solution

-- 1) Emails with duplicate counts
SELECT email, COUNT(*) AS cnt
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

-- 2) All rows with duplicate emails
SELECT c.*
FROM customers c
JOIN (
  SELECT email
  FROM customers
  GROUP BY email
  HAVING COUNT(*) > 1
) d USING(email);

Exercise 2 — Flag outliers with IQR

List of order_value: [12, 13, 14, 15, 16, 17, 18, 55, 19, 20, 21, 22, 100]

Compute Q1, Q3, IQR, lower fence, upper fence.
Identify outliers.

Expected output: outliers are 55 and 100 (values may vary slightly by percentile method).

Show solution

Approx values:

Q1 ≈ 15.5, Q3 ≈ 21.5 → IQR ≈ 6.0
Lower fence ≈ 15.5 − 1.5×6 = 6.5
Upper fence ≈ 21.5 + 1.5×6 = 30.5
Outliers: 55, 100

Spreadsheet tips:

=QUARTILE.INC(range,1) and =QUARTILE.INC(range,3)
Compute IQR and fences; flag with =IF(OR(x<lower, x>upper),"Outlier","OK")

Self-check checklist

I can explain the data grain and key used for uniqueness.
I can count and inspect duplicate keys.
I can implement a deterministic dedupe rule (e.g., latest record).
I can compute IQR fences and/or z-scores and flag outliers.
I document cleaning decisions and re-validate metrics after changes.

Common mistakes and how to self-check

Using the wrong key for uniqueness. Self-check: Write a one-sentence "row grain" statement and verify with stakeholders.
Dropping legitimate rare events. Self-check: Sample outliers and annotate cause (error vs true). Keep if business-valid.
Relying only on average to spot outliers. Self-check: Always view distribution (min, Q1, median, Q3, max).
Partial dedup without a tie-breaker. Self-check: If two rows have same key and timestamp, define a secondary rule.
Forgetting to re-check downstream joins. Self-check: After dedupe, re-run joins and confirm row counts and KPIs.

Practical projects

Customer 360: Build a deduped customer dimension using email and phone with a clear survivorship rule.
Sales quality monitor: A daily job that counts duplicate order_ids and flags outlier order_values.
Marketing list hygiene: Standardize and deduplicate campaign contact lists; measure impact on send volume.

Learning path

Before this: Data profiling basics (distributions, null checks).
This lesson: Duplicates and outliers—detect, decide, document.
Next: Handling missing values, standardization, and validation rules across pipelines.

Quick Test

The quick test is available to everyone; only logged-in users get saved progress.

Next steps

Automate: Turn your duplicate and outlier checks into repeatable queries or notebook cells.
Document: Add your rules to a shared data quality playbook.
Monitor: Schedule counts of duplicates and outliers and alert on thresholds.

Mini challenge

You receive a weekly orders extract where order_id should be unique. This week, duplicate order_ids increased from 0.2% to 3%. Draft a 5-step plan to identify the source, stop the growth, and patch existing data without breaking dashboards. Include how you will validate success.

Menu

Checking Duplicates And Outliers

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Core techniques

Worked examples

Step-by-step recipe

Exercises

Exercise 1 — Find duplicate customer emails (SQL)

Exercise 2 — Flag outliers with IQR

Self-check checklist

Common mistakes and how to self-check

Practical projects

Learning path

Quick Test

Next steps

Mini challenge

Practice Exercises

Find duplicate customer emails (SQL)

Instructions

Expected Output

Flag outliers with IQR

Checking Duplicates And Outliers — Quick Test

Have questions about Checking Duplicates And Outliers?

AI Assistant