Why this matters
Real datasets are messy. As a Data Analyst, you will constantly handle blanks, NAs, special values like -1 or "unknown", and mixed types. Correct handling of missing values affects:
- KPIs and dashboards (avoid biased averages and counts).
- Experiments and funnel analysis (don’t drop critical rows by accident).
- Feature creation and modeling (proper nullable types prevent silent coercions).
Typical day-to-day tasks this unlocks
- Reading CSVs with custom NA markers and preserving integers with missing values.
- Creating robust filters and groupby aggregations that respect missing data.
- Merging datasets and coalescing fields safely without breaking types.
Note: You can take the quick test on this page. Anyone can practice; saved progress is available to logged-in users.
Who this is for
- Aspiring or junior Data Analysts who work primarily in pandas.
- Analysts migrating from spreadsheets where blanks behave differently.
- Anyone tired of float-ifying integer columns just because of a single NA.
Prerequisites
- Basic pandas: DataFrame/Series, selection, groupby.
- Python basics: lists, dicts, functions.
- Comfort with reading CSVs and simple data cleaning.
Concept explained simply
Missing values are placeholders for "we don’t know this yet". Pandas supports a dedicated missing marker pd.NA and a family of nullable dtypes so you can keep correct types even when data is incomplete.
- Detect:
isna()/isnull()(same),notna(). - Handle:
fillna(),dropna(),interpolate(),where()/mask(). - Convert:
convert_dtypes()(auto), orastype('Int64'),astype('Float64'),astype('boolean'),astype('string').
Mental model
Think of three layers:
- Representation: What is missing? (
pd.NA,np.nan,NaT, blanks, or sentinels like -1) - Type: Can this column store missing values without changing its logical type? (Use nullable dtypes.)
- Decision: For each analysis step, choose to drop, impute, or keep missing values. Always make the choice explicit.
Nullable dtypes in pandas
- Integers:
Int8/16/32/64store integers and<NA>without casting to float. - Unsigned Integers:
UInt8/16/32/64plus<NA>. - Floats:
Float32/64nullable versions using<NA>(distinct from classicfloat64withnp.nan). - Booleans:
boolean(lowercase) supports True/False/<NA>with three-valued logic. - Strings:
string(notobject) supports<NA>.
Datetime-like types commonly use NaT for missing, but isna() handles them consistently. You can mix pd.NA and NaT in workflows; use isna() to check missing uniformly.
Key behaviors to remember
- Comparisons: Any comparison with
pd.NAyields<NA>, not True/False. Useisna()/notna(). - Boolean logic: With dtype
boolean,True & <NA>is<NA>;False & <NA>isFalse. - Arithmetic: Operations with
<NA>often yield<NA>. Impute orfillna()if needed.
Worked examples
1) Keep integers as integers with missing
import pandas as pd
s = pd.Series([1, 2, None, 4]) # float64 because of None
s = s.astype('Int64') # now nullable integer
a_median = int(s.dropna().median()) # 2
s_filled = s.fillna(a_median) # OK: stays Int64
print(s.dtype, s_filled.tolist())
# Int64 [1, 2, 2, 4]Why this works
Using Int64 avoids the classic float-cast for missing integers. Filling with an int preserves the dtype.
2) Nullable boolean and safe filtering
df = pd.DataFrame({'opted_in': [True, pd.NA, False, pd.NA]})
df['opted_in'] = df['opted_in'].astype('boolean')
only_opted = df[df['opted_in'] == True] # explicit comparison, avoids <NA> ambiguity
rate = df['opted_in'].mean(skipna=True) # ignores <NA> by default
print(only_opted.index.tolist(), round(rate, 2))
# [0] 0.5Tip
A direct filter like df[df['opted_in']] drops <NA> rows (treated as Falsey). Be explicit if needed.
3) Replace sentinels and coalesce after merge
left = pd.DataFrame({'id': [1,2,3], 'city_left': ['NY', '-1', 'SF']})
right = pd.DataFrame({'id': [1,2,3], 'city_right': ['NYC', None, 'San Fran']})
merged = left.merge(right, on='id', how='left')
merged['city_left'] = merged['city_left'].replace('-1', pd.NA).astype('string')
merged['city'] = merged['city_right'].astype('string').combine_first(merged['city_left'])
print(merged[['id','city']])
# id city
# 0 1 NYC
# 1 2 <NA>
# 2 3 San FranWhy combine_first
combine_first picks non-missing values from the first Series, then falls back to the second. Safer than manual fillna chains after merges.
4) Reading CSV with custom missing markers
# Suppose the file uses '', 'NA', and 'unknown' as missing
# and we want an integer id with missing allowed.
df = pd.read_csv(
'file.csv',
na_values=['', 'NA', 'unknown'],
keep_default_na=True,
dtype={'id': 'Int64'}
)
# Optionally align dtypes across the frame
df = df.convert_dtypes()Step-by-step: Adopting nullable dtypes
- Standardize missing markers: Replace blanks, 'NA', -1, 'unknown' with
pd.NA. - Choose dtypes: Use
convert_dtypes()then explicitly set critical columns (e.g.,astype('Int64')). - Decide per column: Drop, impute, or keep missing—and document your choice.
- Validate: Re-check dtypes and counts:
df.dtypes,df.isna().sum(). - Lock in: Save cleaned datasets to a reliable format (e.g., Parquet) preserving dtypes.
Common mistakes (and self-check)
- Mistake: Comparing with
==to find missing. Fix: Useisna()/notna(). - Mistake: Letting one NA force integers to float. Fix: Use
Int64and friends. - Mistake: Blind
dropna()causing data loss. Fix: Usesubset=and consider imputation. - Mistake: Filling booleans with strings like 'False'. Fix: Convert to
booleandtype first; fill with True/False orpd.NA.
Self-check quick scan
- Do
df.isna().sum()before and after cleaning. - Print
df.dtypesand confirm logical types (Int64, Float64, boolean, string). - Spot-check 5–10 rows around previously missing values.
Exercises
Mirror of the exercises below. Try to solve before opening the solutions.
Exercise 1 — Clean and convert with nullable dtypes
Goal: Replace custom missing markers, convert to nullable dtypes, and impute.
- Create a DataFrame with columns
age,income,subscribedusing the given data in the exercise card below. - Treat
'','NA', and-1as missing where appropriate. - Convert:
age -> Int64,income -> Float64,subscribed -> boolean(map yes/no -> True/False). - Impute: age with median (integer), income with median, subscribed with False.
- Verify dtypes and that there are 0 missing values.
Exercise 2 — Conditional imputation with mask/where
Goal: Fill missing discounts differently by region.
- Ensure
discountisFloat64. - For region
'A', fill missing discounts with0.0. - For other regions, fill missing with the overall mean of existing discounts.
- Print the final discount list and dtype.
Checklist before you move on
- You replaced inconsistent missing markers with
pd.NA. - You used nullable dtypes instead of default numpy ones where needed.
- Your boolean logic didn’t accidentally drop
<NA>rows. - You validated changes with
isna().sum()anddtypes.
Practical projects
- Customer data audit: Load two CSVs (customers, orders), standardize missing, convert to nullable dtypes, merge, and produce a data quality report (missing by column, by segment).
- Pricing gaps: Given product prices over months, identify where price is missing, forward-fill within product, then flag products still missing after fill as needing manual review.
- Subscription funnel: Build a boolean/nullable-boolean feature set for steps (viewed, trial_started, paid). Carefully count conversion rates handling
<NA>correctly.
Learning path
- Before this: Reading Data with pandas - CSV/Parquet, date parsing.
- This subskill: Missing values and nullable dtypes.
- Next: Groupby nuances with missing categories, joins, and coalescing fields.
Next steps
- Refactor one of your past notebooks to use nullable dtypes and recheck metrics.
- Adopt a standard missing-data policy for your team: markers, dtypes, and imputation rules.
- Take the quick test below to confirm mastery.
Mini challenge
You have a column country_code with values like 'US', 'GB', '', 'N/A', and None. Convert it to string dtype, normalize missing to <NA>, and compute the top 3 countries by count excluding <NA>. How would you verify there are no stray placeholders left?