How to learn Missing Values Handling for Data Cleaning in Data Analyst for free

Who this is for

Data analysts who need clean, reliable datasets for dashboards, descriptive analysis, and modeling. If you work with CSVs, spreadsheets, or databases where some values are blank, null, or suspicious placeholders, this is for you.

Prerequisites

Basic data types: numeric, categorical, date/time
Comfort with spreadsheets or a scripting tool (e.g., Python, R) to compute counts/percentages
Understanding of business context for the dataset

Why this matters

Real-world data often has gaps: missing ages, blank cities, or days without sales. As a Data Analyst, you will:

Prepare accurate KPIs: A single wrong imputation can shift averages and trends.
Ensure fair comparisons: Missing data can bias segment analysis (e.g., new vs. returning customers).
Enable modeling: Many algorithms cannot handle missing values directly.
Build trust: Clear handling and documentation makes results reproducible and defensible.

Concept explained simply

Missing values are holes in your dataset. Some holes happen at random, some are systematic (e.g., a field was optional). Your job is to identify the hole type, decide if it matters, and choose whether to fill it, flag it, or drop it.

Mental model

Think of your dataset as a puzzle. Missing pieces can be:

Truly lost: No information exists (e.g., a customer skipped the field).
Hidden in plain sight: Placeholder values like -999, "N/A", or empty strings.
Meaningful absence: Missing might actually mean something (e.g., missing discount means no discount used).

Three common patterns:

MCAR (Missing Completely At Random): Missingness is pure chance. Simple imputations usually OK.
MAR (Missing At Random): Missingness depends on observed data. Stratified or model-based imputation helps.
MNAR (Missing Not At Random): Missingness depends on the missing value itself. Often best handled with domain rules and indicator flags.

How to detect missing values

Explicit missing: null, NA, NaN, None.
Implicit missing: empty strings, whitespace-only, zeros used as placeholders, special codes (e.g., 999, -1).
Time-series gaps: missing dates/timestamps.
Row-level holes: rows with many missing fields may be dropped.

Quick audit checklist

Count missing per column and overall percentage.
Check unique placeholders (e.g., -999, "Unknown").
Compare distributions before/after removing missing.
Check missingness by segment (e.g., by region or signup channel).
For time-series: find missing dates and run length of gaps.

Choosing a strategy

Match the strategy to the variable type, business meaning, amount of missingness, and analysis goal.

Drop rows/columns

When: Very small fraction of rows missing, or a column is >60–80% missing and not critical.
Pros: Simple, no biases introduced by guessing.
Cons: Lose data; can bias if not MCAR.

Impute simple: mean, median, mode

Numeric: median is robust to outliers; mean for symmetric distributions.
Categorical: mode or a special category like "Unknown".
Pros: Fast, transparent.
Cons: Can shrink variance; may bias relationships.

Domain rules and constants

Examples: missing discount = 0; missing shipment date = not shipped yet.
Pros: Often most accurate when grounded in process knowledge.
Cons: Requires validation with stakeholders.

Forward/Backward fill (time-series)

When: Sensors, daily metrics, inventories.
Tip: Limit max fill window to avoid propagating long periods.

Interpolation (time-series)

Linear for short gaps; more advanced methods if needed.
Preserves trend; avoid across structural breaks (e.g., promo days).

Model-based imputation (e.g., KNN, regression)

When: MAR patterns and enough data.
Pros: Uses relationships between variables.
Cons: Complexity; risk of leakage if done improperly.

Always consider adding a missing indicator flag (0/1). It preserves the information that a value was missing and can reveal meaningful patterns.

Decision steps you can reuse

Quantify: For each column, compute % missing and list placeholder codes.
Diagnose: Check by segment/time. Is missingness random or patterned?
Select: Choose drop, simple impute, domain rule, time-series method, or model-based.
Flag: Add a _was_missing indicator.
Validate: Compare distributions and key metrics before vs. after.
Document: Note rules, thresholds, and rationale in a data dictionary.

Mini task per step

Quantify: Write down top 3 columns by % missing.
Diagnose: Name 1 segment with higher missingness.
Select: State which method you’ll use and why.
Flag: Name your indicator column conventions (e.g., age_was_missing).
Validate: List 2 metrics you’ll re-check (mean, variance, group share).
Document: One-sentence rule per column.

Worked examples

Example 1 — Numeric (median + flag)

age: [28, null, 34, 31, null, 29]
% missing: 2/6 = 33.3%
Distribution (non-missing): [28,34,31,29] median=29.5
Decision: Impute median=29.5, add age_was_missing
Result:
age: [28, 29.5, 34, 31, 29.5, 29]
age_was_missing: [0,1,0,0,1,0]
Validation: Mean before (non-missing)=30.5; after (with impute)=30.5 (unchanged)

Example 2 — Categorical (Unknown vs. mode)

city: ["Denver", "", "Austin", null, "Denver"]
Placeholders: empty string, null
Non-missing: ["Denver","Austin","Denver"] mode = "Denver"
Decision A: Use "Unknown" to avoid over-amplifying "Denver"
Result: ["Denver","Unknown","Austin","Unknown","Denver"]
Add city_was_missing flag

Example 3 — Time-series (forward fill with limit + interpolation)

date: D1 D2 D3 D4 D5 D6
sales: 10 null null 14 null 16
Strategy: forward fill with max 1-day window, then linear interpolate remaining short gaps
Step 1 (ffill limit 1): [10,10,null,14,14,16]
Step 2 (interpolate remaining): gap at D3 between 10 and 14 => 12
Final: [10,10,12,14,14,16]
Flags: sales_was_missing: [0,1,1,0,1,0]

Example 4 — MNAR (domain rule)

discount_applied: [null, 5, null, 10]
Domain: missing means 0 (no discount)
Result: [0,5,0,10]; add discount_was_missing flag to track the original absence

Exercises (mirror of the tasks below)

Exercise 1 — Audit and impute a mixed dataset
Instructions
Dataset (CSV rows):
```
id,age,income,city,joined_at,discount_code
1,28,52000,Denver,2023-05-01,
2,,61000,,2023-05-03,SPRING
3,34,,Austin,2023-05-07,
4,31,59000,Denver,,
5,,54000,,2023-05-09,
```
- Compute % missing by column.
- Decide a strategy for each column (drop/impute/flag).
- Apply: numeric -> median; city -> "Unknown"; joined_at -> leave missing and add flag; discount_code -> missing means none ("NONE").
- Show the cleaned rows for ids 1–3.
Exercise 2 — Time-series with gaps
Instructions
Daily users:
```
day: 1 2 3 4 5 6 7 8 9 10
users: 12 null null 15 16 null 18 null null 21
```
- Forward fill with a maximum window of 2 days.
- Then linearly interpolate any remaining single-day gaps.
- Add users_was_missing flag.
- Provide the final series.

Checklist

I listed all placeholder codes and true nulls.
I quantified missingness by column and by segment.
I chose strategies suited to data type and business meaning.
I added _was_missing indicators where appropriate.
I validated key metrics before vs. after.
I documented every rule in plain language.

Common mistakes and self-check

Using mean imputation on skewed numeric data. Self-check: Compare median/quantiles before vs. after.
Overusing mode for categorical data, inflating a popular category. Self-check: Track category shares.
Filling long time-series gaps with forward fill. Self-check: Limit window; review flagged long gaps.
Ignoring MNAR semantics. Self-check: Ask, does missing actually mean something?
Not flagging imputed values. Self-check: Ensure _was_missing exists for critical fields.

Practical projects

Customer profile cleaning: Age, income, city, and signup date with 10–20% missing. Deliver a one-page data quality report.
Sales daily time-series: Fill short gaps, leave long gaps flagged. Produce a clean series and a chart of gap lengths.
Product catalog: Normalize placeholders (N/A, -, blank) and create a codebook of rules used.

Learning path

Start: Missing detection and placeholder normalization.
Next: Column-wise strategies (numeric, categorical, time-series).
Advance: MAR vs. MNAR diagnosis and model-based imputation.
Always: Validation and documentation habits.

Mini challenge

Pick one dataset you use weekly. In 20 minutes, list top 3 columns by missingness, choose a strategy for each, and create a one-paragraph note explaining your choices to a stakeholder.

Quick Test

Take the quick test below to check your understanding. The test is available to everyone. If you are logged in, your progress will be saved automatically.

Menu

Missing Values Handling

Table of Contents