Why this matters
Real datasets have gaps: blank cells, NaN values, missing timestamps. As a Data Analyst, you will:
- Prepare datasets for dashboards where missing values would break KPIs.
- Clean features before modeling to avoid biased or failed training.
- Merge sources with different completeness and ensure results are trustworthy.
Mastering isna, fillna, dropna, and interpolation lets you diagnose and fix these gaps quickly and safely.
Concept explained simply
Think of your dataset as Swiss cheese: data is the cheese; holes are missing values (NaN/None/NaT). You can:
- Find holes:
.isna()/.notna()(booleans). - Fill holes:
.fillna()(with a value, mean/median, forward/backward fill). - Cut out too-holey pieces:
.dropna(). - Estimate smooth values between known points:
.interpolate()(best for ordered/time data).
Mental model:
- Detect: where and how much is missing?
- Decide: drop, fill with constant/statistic, group-wise fill, or interpolate.
- Document: why you chose that strategy.
- Validate: check distributions and counts after fixing.
Quick reference
Detect missing values
import pandas as pd
# Booleans
mask = df.isna() # same shape as df
# Counts
col_counts = df.isna().sum()
row_counts = df.isna().sum(axis=1)
# Percent per column
pct = df.isna().mean() * 100
# Filter rows with any NA in a subset
subset_rows = df[df[["age","income"]].isna().any(axis=1)]
Fill missing values
# Fill with a scalar
df["age"] = df["age"].fillna(df["age"].median())
# Fill different columns with different values
df = df.fillna({"city":"Unknown","score":0})
# Forward/backward fill (useful after sorting)
df = df.sort_values("date")
df["price"] = df["price"].fillna(method="ffill", limit=2)
df["price"] = df["price"].fillna(method="bfill", limit=1)
Drop missing values
# Drop rows with any missing values
df1 = df.dropna()
# Drop rows if ALL values are missing
df2 = df.dropna(how="all")
# Keep rows that have at least 'thresh' non-NA values
# e.g., require at least 3 non-NA values in the row
df3 = df.dropna(thresh=3)
# Only consider a subset of columns when deciding to drop
df4 = df.dropna(subset=["age","income"])
Interpolate values
# Time-based interpolation requires a monotonically increasing time index
df = df.sort_values("date").set_index("date")
# Interpolate numeric columns by time
df["sales"] = df["sales"].interpolate(method="time", limit=2)
# Linear interpolation on index order (not by time)
df["temp"] = df["temp"].interpolate(method="linear")
# After interpolation, consider filling edges
df["sales"] = df["sales"].fillna(method="ffill").fillna(method="bfill")
Worked examples
Example 1: Diagnose missingness
import pandas as pd
df = pd.DataFrame({
"age": [34, None, 29, None, 41],
"income": [55000, 72000, None, 64000, None],
"city": ["NY", "LA", None, "NY", "CHI"]
})
print("Missing count per column:\n", df.isna().sum())
print("Missing % per column:\n", (df.isna().mean()*100).round(1))
# Rows with any missing in age or income
print(df[df[["age","income"]].isna().any(axis=1)])
Why it helps: you see severity and which features need attention.
Example 2: Fill numerics with median, categoricals with mode; group-wise
import numpy as np
df = pd.DataFrame({
"region": ["N","N","S","S","S"],
"price": [10.0, np.nan, 12.0, np.nan, 15.0],
"category": ["A", None, "B", None, "B"]
})
# Numeric: group-wise median imputation (avoid leakage across groups)
df["price"] = df.groupby("region")["price"].transform(
lambda s: s.fillna(s.median())
)
# Categorical: global mode
global_mode = df["category"].mode(dropna=True).iat[0]
df["category"] = df["category"].fillna(global_mode)
print(df)
Tip: For modeling, compute imputation stats on training data only to avoid leakage.
Example 3: Drop rows/columns with too many NAs
df = pd.DataFrame({
"a": [1, None, 3, None],
"b": [None, None, 5, 6],
"c": [7, 8, None, None]
})
# Keep rows with at least 2 non-NA values
rows_ok = df.dropna(thresh=2)
# Drop rows missing in critical columns only
critical = df.dropna(subset=["a","b"]) # requires both a and b present
# Drop columns with >= 50% missing
col_missing_pct = df.isna().mean()
df_reduced = df.loc[:, col_missing_pct < 0.5]
print(rows_ok)
print(critical)
print(df_reduced.columns.tolist())
Rule of thumb: Prefer targeted dropping using subset/thresh over dropping everything with any NA.
Example 4: Interpolate time series safely
dates = pd.date_range("2023-01-01", periods=7, freq="D")
df = pd.DataFrame({
"date": dates,
"sales": [10, None, None, 16, None, 20, None]
}).sort_values("date").set_index("date")
# Interpolate by time; only fill runs up to 2 long
interp = df.copy()
interp["sales"] = interp["sales"].interpolate(method="time", limit=2)
# Fill remaining edges with last/next known
interp["sales"] = interp["sales"].fillna(method="ffill").fillna(method="bfill")
print(interp)
Use time-based interpolation when the index is a DateTimeIndex and data changes smoothly.
Choosing a strategy (checklist)
- Is the feature critical for the task? If yes, prefer targeted fixes; avoid dropping it casually.
- Is the data numeric and changes smoothly over time? If yes, consider interpolation.
- Are missing values random? If not, document the pattern (e.g., new products missing price).
- Do you have groups (regions, stores)? Consider group-wise imputation to keep local structure.
- Will this feed a model? Compute imputation stats on training only and store them for inference.
- After fixing, did distributions and row counts stay sensible? Re-check with describe() and value_counts().
Exercises
Work locally in a notebook or Python REPL. Keep notes on your choices and why.
Exercise 1 — Diagnose and flag missingness
Dataset idea: customer records with columns: id, age, income, city, last_purchase_date.
- [ ] Compute missing count and percent per column.
- [ ] Filter rows where age or income is missing.
- [ ] Add a boolean column age_missing (True/False).
- [ ] Create a row-level missing_count; list rows with missing_count >= 2.
Note: Expected outputs are counts, percentages, filtered rows, and new columns.
Exercise 2 — Impute, drop, then validate
Dataset idea: daily_sales with columns: date, store, sales, price, promo_flag.
- [ ] Sort by date; set DateTimeIndex.
- [ ] Group-wise fill price by store median.
- [ ] Fill promo_flag missing as False.
- [ ] Interpolate sales by time with limit=2, then ffill/bfill edges.
- [ ] Drop rows where sales and price are both still missing (subset).
- [ ] Validate by checking that sales is numeric, monotonicity isn’t forced, and no giant jumps were introduced.
Note: Document the before/after NA counts and a few sample rows.
Quick Test is available to everyone; log in to save your progress.
Common mistakes and self-check
- Forward-filling without sorting time. Self-check: Did you sort by date before ffill/bfill?
- Using global mean for grouped data. Self-check: Would group-level medians better reflect local behavior?
- Dropping too aggressively. Self-check: Did you use subset/thresh instead of dropna() with defaults?
- Interpolating categorical fields. Self-check: Only interpolate numeric continuous data.
- Data leakage in ML. Self-check: Are imputation stats computed only from training data?
- Forgetting to validate results. Self-check: Recompute isna().sum() and compare distributions before/after.
Practical projects
- Retail dashboard cleanup: Clean a month of sales data (sales, price, stock) with missing values; produce a simple daily KPI chart that doesn’t break on NAs.
- Sensor stream repair: Interpolate missing temperature readings using time-based methods and compare with last-observation-carried-forward.
- Customer segmentation prep: Impute age/income sensibly, document choices, and export a clean dataset with a data dictionary of imputations.
Who this is for
- Aspiring and junior Data Analysts preparing data for reporting and modeling.
- Anyone dealing with CSV/Excel files that often contain blanks or inconsistent data.
Prerequisites
- Basic Python and pandas (DataFrame, Series, indexing).
- Comfort with boolean masks and simple aggregations.
Learning path
- Refresh pandas basics: selection, boolean masks, groupby.
- Diagnose missingness with isna and summaries.
- Apply fillna strategies: constants, statistics, group-wise.
- Use dropna with subset and thresh.
- Interpolate time series appropriately.
- Validate and document your choices.
Next steps
- Automate your cleaning steps into reusable functions.
- Add data validation checks before and after cleaning.
- Practice on 2–3 different datasets with varying missingness patterns.
Mini challenge
You have hourly web traffic data with columns: timestamp, page_views, avg_session_duration. 10% of page_views and 5% of duration are missing, often in 1–3 hour runs.
- Choose and apply an interpolation strategy (limit=3) for both metrics.
- Explain why you picked time vs linear method.
- Show before/after missing counts and one chart-friendly summary (min/median/max).
Ready for the quick test?
Take the short test to check your understanding. Everyone can take it; log in to save your progress.