How to learn Data Cleaning for Data Analyst for free

What is Data Cleaning and why it matters for Data Analysts

Data Cleaning is the process of fixing or removing incorrect, corrupted, duplicated, or incomplete data. For a Data Analyst, it is the foundation of trustworthy insights. Clean data powers accurate dashboards, reliable A/B test results, sound forecasting, and better stakeholder decisions.

Unlocks repeatable analysis and faster time-to-insight.
Reduces costly mistakes from bad joins, wrong types, or hidden duplicates.
Prepares data for modeling, segmentation, and experimentation.

Who this is for

Aspiring and junior Data Analysts who want job-ready skills.
Analysts who frequently work with spreadsheets, SQL, or Python and need consistent, reproducible results.
Anyone maintaining dashboards or datasets for product, marketing, finance, or operations.

Prerequisites

Basic spreadsheet skills (filters, sorting, simple formulas).
Beginner SQL (SELECT, WHERE, GROUP BY) or beginner Python (pandas DataFrame basics). Either works; both are a plus.
Comfort reading column names, data types, and simple error messages.

Learning path (practical roadmap)

Profile the data: Inspect columns, types, ranges, null rates, unique keys, and row counts before and after each step.
Fix critical integrity issues: Remove true duplicates, repair keys, and correct data types.
Standardize formats: Dates, text casing, whitespace, currencies, units, and categorical levels.
Handle missing data: Decide when to drop, impute (mean/median/mode), or flag as unknown.
Detect and treat outliers: Investigate causes; cap, transform, or exclude with justification.
Validate: Write rules and assertions (e.g., no negative ages, join completeness, unique IDs).
Document and log: Keep a simple checklist or script so anyone can reproduce the cleaning.

Mini tasks to practice each step

Profile: Count nulls per column and list top 10 frequent values.
Integrity: Prove your primary key is unique or show offending rows.
Standardize: Normalize phone numbers to E.164 or set dates to ISO (YYYY-MM-DD).
Missing: Compare median vs mean imputation on a skewed column.
Outliers: Flag observations outside 1.5×IQR; review 10 random flagged rows.
Validate: Add a rule: each order_id must exist in orders table.
Document: Save steps as a numbered checklist or a single script.

Worked examples (SQL and Python)

1) Handling missing values

Situation: price has 12% nulls and is right-skewed. Median imputation is safer than mean.

SQL example

-- Impute with median using a CTE (for databases without MEDIAN, approximate via percentile_cont)
WITH stats AS (
  SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY price) AS med
  FROM sales
)
SELECT s.*, COALESCE(s.price, stats.med) AS price_imputed
FROM sales s CROSS JOIN stats;

Python (pandas)

import pandas as pd
med = df['price'].median()
df['price_imputed'] = df['price'].fillna(med)

2) Removing true duplicates safely

Goal: Keep the latest record per (user_id) by created_at.

SQL window function

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
  FROM users_raw
)
SELECT * FROM ranked WHERE rn = 1;

Python (pandas)

df = df.sort_values('created_at', ascending=False)
df_dedup = df.drop_duplicates(subset=['user_id'], keep='first')

3) Fixing data types and standardizing formats

Consistent types prevent silent bugs in joins and aggregations.

Dates and currency

# Parse dates robustly and coerce errors to NaT
orders['order_date'] = pd.to_datetime(orders['order_date'], errors='coerce')

# Clean currency like "$1,299.50"
orders['price_usd'] = (orders['price']
  .astype(str)
  .str.replace('[^0-9.-]', '', regex=True)
  .replace('', pd.NA)
  .astype(float))

Text casing and whitespace

customers['email_clean'] = (customers['email']
  .str.strip()
  .str.lower())

4) Detecting and treating outliers

Use the IQR rule to flag candidates, then decide action.

Python IQR method

import numpy as np
q1, q3 = df['amount'].quantile([0.25, 0.75])
iqr = q3 - q1
low, high = q1 - 1.5*iqr, q3 + 1.5*iqr
mask = (df['amount'] < low) | (df['amount'] > high)

# Option 1: cap (winsorize)
df['amount_capped'] = df['amount'].clip(lower=low, upper=high)

SQL quick flag

WITH q AS (
  SELECT
    percentile_cont(0.25) WITHIN GROUP (ORDER BY amount) AS q1,
    percentile_cont(0.75) WITHIN GROUP (ORDER BY amount) AS q3
  FROM payments
), bounds AS (
  SELECT q1, q3, (q3 - q1) AS iqr, q1 - 1.5*(q3 - q1) AS low, q3 + 1.5*(q3 - q1) AS high
  FROM q
)
SELECT p.*, (p.amount < b.low OR p.amount > b.high) AS is_outlier
FROM payments p CROSS JOIN bounds b;

5) Clean merges and joins

Trim keys, align cases, and audit unmatched rows.

SQL anti-join to find unmatched keys

-- Orders without a customer match
SELECT o.*
FROM orders o
LEFT JOIN customers c
  ON TRIM(LOWER(o.email)) = TRIM(LOWER(c.email))
WHERE c.customer_id IS NULL;

Pandas join with prepared keys

orders['email_key'] = orders['email'].str.strip().str.lower()
customers['email_key'] = customers['email'].str.strip().str.lower()
merged = orders.merge(customers, on='email_key', how='left', indicator=True)

# Investigate non-matches
unmatched = merged[merged['_merge'] == 'left_only']

Drills and quick exercises

[ ] Compute null percentage per column and list top 3 most-affected columns.
[ ] Prove primary key uniqueness; if it fails, export 20 example conflicts.
[ ] Convert a mixed-type date column to ISO format with coercion, count parsing failures.
[ ] Create standardized text keys (strip + lowercase) and measure join improvement.
[ ] Use IQR to cap outliers, then compare mean and standard deviation before vs after.
[ ] Write 5 validation rules (e.g., age between 0 and 110, price >= 0); run and log results.

Common mistakes and debugging tips

Dropping data too aggressively. Prefer targeted fixes; sample and review before delete.
Imputing with mean on skewed data. Use median for robustness.
Ignoring units and time zones. Standardize to one unit and one timezone convention.
Wrong join type. Verify row counts before and after; inspect anti-joins to find losses.
No post-clean validation. Add assertions and summary stats each step.
Inconsistent category labels. Create a mapping dictionary and a fallback "Other".
Not logging steps. Keep a simple, ordered script or checklist to reproduce results.

Quick debugging checklist

Row count expected? Compare before vs after each transformation.
Key uniqueness? Run distinct count vs row count.
Type correctness? Print dtypes and sample 10 rows of critical columns.
Join health? Count matched vs unmatched; review 20 random unmatched examples.
Distribution shifts? Plot/describe numeric columns before and after cleaning.

Mini project: Orders quality pack

Goal: Produce a cleaned, validated orders dataset ready for reporting.

Inputs: customers.csv, orders.csv, payments.csv.
Tasks:
- Profile each file: row count, null stats, suspected keys.
- Fix types: dates to ISO, price to numeric USD, ids to strings.
- Standardize text: emails and names (strip, lower; proper case for names if needed).
- Handle missing: median impute price if missing, set unknown states to "Unknown".
- Detect outliers in payment amounts; cap with IQR rule.
- Create robust join keys; merge into a single wide table.
- Add validation rules (e.g., total_paid >= 0, order_date <= today).
- Log steps and produce a short README with decisions.
Deliverables:
- Cleaned dataset (CSV or table).
- Validation report (counts of rules passed/failed).
- Script or notebook with ordered steps.

Acceptance checks

No duplicate order_id rows.
0 rows with negative price or payment.
< 1% rows with unparsed dates (or justified).
All joins have <= 2% unmatched after key standardization, or documented reasons.

More practical project ideas

Marketing leads cleanup: deduplicate by email + phone, standardize country codes, validate required fields.
Product catalog normalization: unify units (kg vs lb), fix categories, remove orphan SKUs.
Support tickets text cleaning: remove PII, normalize tags, detect duplicates across channels.

Subskills

Missing Values Handling — decide when to drop, impute, or flag; compare methods.
Duplicate Removal — identify exact and fuzzy duplicates; define primary keys.
Data Type Fixes — coerce/convert columns safely and handle parse failures.
Standardizing Formats — unify dates, casing, whitespace, currencies, and units.
Outlier Detection and Treatment — flag with IQR/Z-score and decide capping/exclusion.
Text Cleaning — strip, lower, tokenize, remove noise, and protect PII.
Data Validation Rules — write assertions and thresholds; create pass/fail reports.
Data Consistency Checks — reconcile totals, ensure referential integrity.
Handling Categorical Values — map, group rare labels, align training/production categories.
Normalizing and Scaling — standardize or min-max numeric features where needed.
Merge and Join Cleanup — prepare keys, audit unmatched, fix join types.
Logging and Reproducible Cleaning Pipelines — scripts, notebooks, or SQL with clear steps.

Next steps

Apply these steps to one live dataset you use weekly and automate the cleaning.
Add validations to your dashboards so bad data is flagged automatically.
Move on to Exploratory Data Analysis and Feature Engineering once cleaning feels routine.

Menu

Data Cleaning

Table of Contents

What is Data Cleaning and why it matters for Data Analysts

Who this is for

Prerequisites

Learning path (practical roadmap)

Worked examples (SQL and Python)

1) Handling missing values

2) Removing true duplicates safely

3) Fixing data types and standardizing formats

4) Detecting and treating outliers

5) Clean merges and joins

Drills and quick exercises

Common mistakes and debugging tips

Mini project: Orders quality pack

More practical project ideas

Subskills

Next steps

Data Cleaning — Skill Exam

Topics

Normalizing and Scaling

Missing Values Handling

Duplicate Removal

Data Type Fixes

Standardizing Formats

Outlier Detection and Treatment

Text Cleaning

Data Validation Rules

Data Consistency Checks

Handling Categorical Values

Merge and Join Cleanup

Logging and Reproducible Cleaning Pipelines

Have questions about Data Cleaning?

AI Assistant