luvv to helpDiscover the Best Free Online Tools

Data Cleaning

Learn Data Cleaning for Data Analyst for free: roadmap, examples, subskills, and a skill exam.

Published: December 19, 2025 | Updated: December 19, 2025

What is Data Cleaning and why it matters for Data Analysts

Data Cleaning is the process of fixing or removing incorrect, corrupted, duplicated, or incomplete data. For a Data Analyst, it is the foundation of trustworthy insights. Clean data powers accurate dashboards, reliable A/B test results, sound forecasting, and better stakeholder decisions.

  • Unlocks repeatable analysis and faster time-to-insight.
  • Reduces costly mistakes from bad joins, wrong types, or hidden duplicates.
  • Prepares data for modeling, segmentation, and experimentation.

Who this is for

  • Aspiring and junior Data Analysts who want job-ready skills.
  • Analysts who frequently work with spreadsheets, SQL, or Python and need consistent, reproducible results.
  • Anyone maintaining dashboards or datasets for product, marketing, finance, or operations.

Prerequisites

  • Basic spreadsheet skills (filters, sorting, simple formulas).
  • Beginner SQL (SELECT, WHERE, GROUP BY) or beginner Python (pandas DataFrame basics). Either works; both are a plus.
  • Comfort reading column names, data types, and simple error messages.

Learning path (practical roadmap)

  1. Profile the data: Inspect columns, types, ranges, null rates, unique keys, and row counts before and after each step.
  2. Fix critical integrity issues: Remove true duplicates, repair keys, and correct data types.
  3. Standardize formats: Dates, text casing, whitespace, currencies, units, and categorical levels.
  4. Handle missing data: Decide when to drop, impute (mean/median/mode), or flag as unknown.
  5. Detect and treat outliers: Investigate causes; cap, transform, or exclude with justification.
  6. Validate: Write rules and assertions (e.g., no negative ages, join completeness, unique IDs).
  7. Document and log: Keep a simple checklist or script so anyone can reproduce the cleaning.
Mini tasks to practice each step
  • Profile: Count nulls per column and list top 10 frequent values.
  • Integrity: Prove your primary key is unique or show offending rows.
  • Standardize: Normalize phone numbers to E.164 or set dates to ISO (YYYY-MM-DD).
  • Missing: Compare median vs mean imputation on a skewed column.
  • Outliers: Flag observations outside 1.5Ă—IQR; review 10 random flagged rows.
  • Validate: Add a rule: each order_id must exist in orders table.
  • Document: Save steps as a numbered checklist or a single script.

Worked examples (SQL and Python)

1) Handling missing values

Situation: price has 12% nulls and is right-skewed. Median imputation is safer than mean.

SQL example
-- Impute with median using a CTE (for databases without MEDIAN, approximate via percentile_cont)
WITH stats AS (
  SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY price) AS med
  FROM sales
)
SELECT s.*, COALESCE(s.price, stats.med) AS price_imputed
FROM sales s CROSS JOIN stats;
Python (pandas)
import pandas as pd
med = df['price'].median()
df['price_imputed'] = df['price'].fillna(med)

2) Removing true duplicates safely

Goal: Keep the latest record per (user_id) by created_at.

SQL window function
WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
  FROM users_raw
)
SELECT * FROM ranked WHERE rn = 1;
Python (pandas)
df = df.sort_values('created_at', ascending=False)
df_dedup = df.drop_duplicates(subset=['user_id'], keep='first')

3) Fixing data types and standardizing formats

Consistent types prevent silent bugs in joins and aggregations.

Dates and currency
# Parse dates robustly and coerce errors to NaT
orders['order_date'] = pd.to_datetime(orders['order_date'], errors='coerce')

# Clean currency like "$1,299.50"
orders['price_usd'] = (orders['price']
  .astype(str)
  .str.replace('[^0-9.-]', '', regex=True)
  .replace('', pd.NA)
  .astype(float))
Text casing and whitespace
customers['email_clean'] = (customers['email']
  .str.strip()
  .str.lower())

4) Detecting and treating outliers

Use the IQR rule to flag candidates, then decide action.

Python IQR method
import numpy as np
q1, q3 = df['amount'].quantile([0.25, 0.75])
iqr = q3 - q1
low, high = q1 - 1.5*iqr, q3 + 1.5*iqr
mask = (df['amount'] < low) | (df['amount'] > high)

# Option 1: cap (winsorize)
df['amount_capped'] = df['amount'].clip(lower=low, upper=high)
SQL quick flag
WITH q AS (
  SELECT
    percentile_cont(0.25) WITHIN GROUP (ORDER BY amount) AS q1,
    percentile_cont(0.75) WITHIN GROUP (ORDER BY amount) AS q3
  FROM payments
), bounds AS (
  SELECT q1, q3, (q3 - q1) AS iqr, q1 - 1.5*(q3 - q1) AS low, q3 + 1.5*(q3 - q1) AS high
  FROM q
)
SELECT p.*, (p.amount < b.low OR p.amount > b.high) AS is_outlier
FROM payments p CROSS JOIN bounds b;

5) Clean merges and joins

Trim keys, align cases, and audit unmatched rows.

SQL anti-join to find unmatched keys
-- Orders without a customer match
SELECT o.*
FROM orders o
LEFT JOIN customers c
  ON TRIM(LOWER(o.email)) = TRIM(LOWER(c.email))
WHERE c.customer_id IS NULL;
Pandas join with prepared keys
orders['email_key'] = orders['email'].str.strip().str.lower()
customers['email_key'] = customers['email'].str.strip().str.lower()
merged = orders.merge(customers, on='email_key', how='left', indicator=True)

# Investigate non-matches
unmatched = merged[merged['_merge'] == 'left_only']

Drills and quick exercises

  • [ ] Compute null percentage per column and list top 3 most-affected columns.
  • [ ] Prove primary key uniqueness; if it fails, export 20 example conflicts.
  • [ ] Convert a mixed-type date column to ISO format with coercion, count parsing failures.
  • [ ] Create standardized text keys (strip + lowercase) and measure join improvement.
  • [ ] Use IQR to cap outliers, then compare mean and standard deviation before vs after.
  • [ ] Write 5 validation rules (e.g., age between 0 and 110, price >= 0); run and log results.

Common mistakes and debugging tips

  • Dropping data too aggressively. Prefer targeted fixes; sample and review before delete.
  • Imputing with mean on skewed data. Use median for robustness.
  • Ignoring units and time zones. Standardize to one unit and one timezone convention.
  • Wrong join type. Verify row counts before and after; inspect anti-joins to find losses.
  • No post-clean validation. Add assertions and summary stats each step.
  • Inconsistent category labels. Create a mapping dictionary and a fallback "Other".
  • Not logging steps. Keep a simple, ordered script or checklist to reproduce results.
Quick debugging checklist
  • Row count expected? Compare before vs after each transformation.
  • Key uniqueness? Run distinct count vs row count.
  • Type correctness? Print dtypes and sample 10 rows of critical columns.
  • Join health? Count matched vs unmatched; review 20 random unmatched examples.
  • Distribution shifts? Plot/describe numeric columns before and after cleaning.

Mini project: Orders quality pack

Goal: Produce a cleaned, validated orders dataset ready for reporting.

  1. Inputs: customers.csv, orders.csv, payments.csv.
  2. Tasks:
    • Profile each file: row count, null stats, suspected keys.
    • Fix types: dates to ISO, price to numeric USD, ids to strings.
    • Standardize text: emails and names (strip, lower; proper case for names if needed).
    • Handle missing: median impute price if missing, set unknown states to "Unknown".
    • Detect outliers in payment amounts; cap with IQR rule.
    • Create robust join keys; merge into a single wide table.
    • Add validation rules (e.g., total_paid >= 0, order_date <= today).
    • Log steps and produce a short README with decisions.
  3. Deliverables:
    • Cleaned dataset (CSV or table).
    • Validation report (counts of rules passed/failed).
    • Script or notebook with ordered steps.
Acceptance checks
  • No duplicate order_id rows.
  • 0 rows with negative price or payment.
  • < 1% rows with unparsed dates (or justified).
  • All joins have <= 2% unmatched after key standardization, or documented reasons.

More practical project ideas

  • Marketing leads cleanup: deduplicate by email + phone, standardize country codes, validate required fields.
  • Product catalog normalization: unify units (kg vs lb), fix categories, remove orphan SKUs.
  • Support tickets text cleaning: remove PII, normalize tags, detect duplicates across channels.

Subskills

  • Missing Values Handling — decide when to drop, impute, or flag; compare methods.
  • Duplicate Removal — identify exact and fuzzy duplicates; define primary keys.
  • Data Type Fixes — coerce/convert columns safely and handle parse failures.
  • Standardizing Formats — unify dates, casing, whitespace, currencies, and units.
  • Outlier Detection and Treatment — flag with IQR/Z-score and decide capping/exclusion.
  • Text Cleaning — strip, lower, tokenize, remove noise, and protect PII.
  • Data Validation Rules — write assertions and thresholds; create pass/fail reports.
  • Data Consistency Checks — reconcile totals, ensure referential integrity.
  • Handling Categorical Values — map, group rare labels, align training/production categories.
  • Normalizing and Scaling — standardize or min-max numeric features where needed.
  • Merge and Join Cleanup — prepare keys, audit unmatched, fix join types.
  • Logging and Reproducible Cleaning Pipelines — scripts, notebooks, or SQL with clear steps.

Next steps

  • Apply these steps to one live dataset you use weekly and automate the cleaning.
  • Add validations to your dashboards so bad data is flagged automatically.
  • Move on to Exploratory Data Analysis and Feature Engineering once cleaning feels routine.

Data Cleaning — Skill Exam

This exam checks core Data Cleaning skills for Data Analysts. You can take it for free. Everyone can attempt the exam; logged-in users will have their progress and results saved. Aim for careful, practical answers. You can retake it anytime.

14 questions70% to pass

Have questions about Data Cleaning?

AI Assistant

Ask questions about this tool