What is Data Cleaning and why it matters for Data Analysts
Data Cleaning is the process of fixing or removing incorrect, corrupted, duplicated, or incomplete data. For a Data Analyst, it is the foundation of trustworthy insights. Clean data powers accurate dashboards, reliable A/B test results, sound forecasting, and better stakeholder decisions.
- Unlocks repeatable analysis and faster time-to-insight.
- Reduces costly mistakes from bad joins, wrong types, or hidden duplicates.
- Prepares data for modeling, segmentation, and experimentation.
Who this is for
- Aspiring and junior Data Analysts who want job-ready skills.
- Analysts who frequently work with spreadsheets, SQL, or Python and need consistent, reproducible results.
- Anyone maintaining dashboards or datasets for product, marketing, finance, or operations.
Prerequisites
- Basic spreadsheet skills (filters, sorting, simple formulas).
- Beginner SQL (SELECT, WHERE, GROUP BY) or beginner Python (pandas DataFrame basics). Either works; both are a plus.
- Comfort reading column names, data types, and simple error messages.
Learning path (practical roadmap)
- Profile the data: Inspect columns, types, ranges, null rates, unique keys, and row counts before and after each step.
- Fix critical integrity issues: Remove true duplicates, repair keys, and correct data types.
- Standardize formats: Dates, text casing, whitespace, currencies, units, and categorical levels.
- Handle missing data: Decide when to drop, impute (mean/median/mode), or flag as unknown.
- Detect and treat outliers: Investigate causes; cap, transform, or exclude with justification.
- Validate: Write rules and assertions (e.g., no negative ages, join completeness, unique IDs).
- Document and log: Keep a simple checklist or script so anyone can reproduce the cleaning.
Mini tasks to practice each step
- Profile: Count nulls per column and list top 10 frequent values.
- Integrity: Prove your primary key is unique or show offending rows.
- Standardize: Normalize phone numbers to E.164 or set dates to ISO (YYYY-MM-DD).
- Missing: Compare median vs mean imputation on a skewed column.
- Outliers: Flag observations outside 1.5Ă—IQR; review 10 random flagged rows.
- Validate: Add a rule: each order_id must exist in orders table.
- Document: Save steps as a numbered checklist or a single script.
Worked examples (SQL and Python)
1) Handling missing values
Situation: price has 12% nulls and is right-skewed. Median imputation is safer than mean.
SQL example
-- Impute with median using a CTE (for databases without MEDIAN, approximate via percentile_cont)
WITH stats AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY price) AS med
FROM sales
)
SELECT s.*, COALESCE(s.price, stats.med) AS price_imputed
FROM sales s CROSS JOIN stats;
Python (pandas)
import pandas as pd
med = df['price'].median()
df['price_imputed'] = df['price'].fillna(med)
2) Removing true duplicates safely
Goal: Keep the latest record per (user_id) by created_at.
SQL window function
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM users_raw
)
SELECT * FROM ranked WHERE rn = 1;
Python (pandas)
df = df.sort_values('created_at', ascending=False)
df_dedup = df.drop_duplicates(subset=['user_id'], keep='first')
3) Fixing data types and standardizing formats
Consistent types prevent silent bugs in joins and aggregations.
Dates and currency
# Parse dates robustly and coerce errors to NaT
orders['order_date'] = pd.to_datetime(orders['order_date'], errors='coerce')
# Clean currency like "$1,299.50"
orders['price_usd'] = (orders['price']
.astype(str)
.str.replace('[^0-9.-]', '', regex=True)
.replace('', pd.NA)
.astype(float))
Text casing and whitespace
customers['email_clean'] = (customers['email']
.str.strip()
.str.lower())
4) Detecting and treating outliers
Use the IQR rule to flag candidates, then decide action.
Python IQR method
import numpy as np
q1, q3 = df['amount'].quantile([0.25, 0.75])
iqr = q3 - q1
low, high = q1 - 1.5*iqr, q3 + 1.5*iqr
mask = (df['amount'] < low) | (df['amount'] > high)
# Option 1: cap (winsorize)
df['amount_capped'] = df['amount'].clip(lower=low, upper=high)
SQL quick flag
WITH q AS (
SELECT
percentile_cont(0.25) WITHIN GROUP (ORDER BY amount) AS q1,
percentile_cont(0.75) WITHIN GROUP (ORDER BY amount) AS q3
FROM payments
), bounds AS (
SELECT q1, q3, (q3 - q1) AS iqr, q1 - 1.5*(q3 - q1) AS low, q3 + 1.5*(q3 - q1) AS high
FROM q
)
SELECT p.*, (p.amount < b.low OR p.amount > b.high) AS is_outlier
FROM payments p CROSS JOIN bounds b;
5) Clean merges and joins
Trim keys, align cases, and audit unmatched rows.
SQL anti-join to find unmatched keys
-- Orders without a customer match
SELECT o.*
FROM orders o
LEFT JOIN customers c
ON TRIM(LOWER(o.email)) = TRIM(LOWER(c.email))
WHERE c.customer_id IS NULL;
Pandas join with prepared keys
orders['email_key'] = orders['email'].str.strip().str.lower()
customers['email_key'] = customers['email'].str.strip().str.lower()
merged = orders.merge(customers, on='email_key', how='left', indicator=True)
# Investigate non-matches
unmatched = merged[merged['_merge'] == 'left_only']
Drills and quick exercises
- [ ] Compute null percentage per column and list top 3 most-affected columns.
- [ ] Prove primary key uniqueness; if it fails, export 20 example conflicts.
- [ ] Convert a mixed-type date column to ISO format with coercion, count parsing failures.
- [ ] Create standardized text keys (strip + lowercase) and measure join improvement.
- [ ] Use IQR to cap outliers, then compare mean and standard deviation before vs after.
- [ ] Write 5 validation rules (e.g., age between 0 and 110, price >= 0); run and log results.
Common mistakes and debugging tips
- Dropping data too aggressively. Prefer targeted fixes; sample and review before delete.
- Imputing with mean on skewed data. Use median for robustness.
- Ignoring units and time zones. Standardize to one unit and one timezone convention.
- Wrong join type. Verify row counts before and after; inspect anti-joins to find losses.
- No post-clean validation. Add assertions and summary stats each step.
- Inconsistent category labels. Create a mapping dictionary and a fallback "Other".
- Not logging steps. Keep a simple, ordered script or checklist to reproduce results.
Quick debugging checklist
- Row count expected? Compare before vs after each transformation.
- Key uniqueness? Run distinct count vs row count.
- Type correctness? Print dtypes and sample 10 rows of critical columns.
- Join health? Count matched vs unmatched; review 20 random unmatched examples.
- Distribution shifts? Plot/describe numeric columns before and after cleaning.
Mini project: Orders quality pack
Goal: Produce a cleaned, validated orders dataset ready for reporting.
- Inputs: customers.csv, orders.csv, payments.csv.
- Tasks:
- Profile each file: row count, null stats, suspected keys.
- Fix types: dates to ISO, price to numeric USD, ids to strings.
- Standardize text: emails and names (strip, lower; proper case for names if needed).
- Handle missing: median impute price if missing, set unknown states to "Unknown".
- Detect outliers in payment amounts; cap with IQR rule.
- Create robust join keys; merge into a single wide table.
- Add validation rules (e.g., total_paid >= 0, order_date <= today).
- Log steps and produce a short README with decisions.
- Deliverables:
- Cleaned dataset (CSV or table).
- Validation report (counts of rules passed/failed).
- Script or notebook with ordered steps.
Acceptance checks
- No duplicate order_id rows.
- 0 rows with negative price or payment.
- < 1% rows with unparsed dates (or justified).
- All joins have <= 2% unmatched after key standardization, or documented reasons.
More practical project ideas
- Marketing leads cleanup: deduplicate by email + phone, standardize country codes, validate required fields.
- Product catalog normalization: unify units (kg vs lb), fix categories, remove orphan SKUs.
- Support tickets text cleaning: remove PII, normalize tags, detect duplicates across channels.
Subskills
- Missing Values Handling — decide when to drop, impute, or flag; compare methods.
- Duplicate Removal — identify exact and fuzzy duplicates; define primary keys.
- Data Type Fixes — coerce/convert columns safely and handle parse failures.
- Standardizing Formats — unify dates, casing, whitespace, currencies, and units.
- Outlier Detection and Treatment — flag with IQR/Z-score and decide capping/exclusion.
- Text Cleaning — strip, lower, tokenize, remove noise, and protect PII.
- Data Validation Rules — write assertions and thresholds; create pass/fail reports.
- Data Consistency Checks — reconcile totals, ensure referential integrity.
- Handling Categorical Values — map, group rare labels, align training/production categories.
- Normalizing and Scaling — standardize or min-max numeric features where needed.
- Merge and Join Cleanup — prepare keys, audit unmatched, fix join types.
- Logging and Reproducible Cleaning Pipelines — scripts, notebooks, or SQL with clear steps.
Next steps
- Apply these steps to one live dataset you use weekly and automate the cleaning.
- Add validations to your dashboards so bad data is flagged automatically.
- Move on to Exploratory Data Analysis and Feature Engineering once cleaning feels routine.