How to learn Working With Missing and Nullable Dtypes for Python pandas in Data Analyst for free

Why this matters

Real datasets are messy. As a Data Analyst, you will constantly handle blanks, NAs, special values like -1 or "unknown", and mixed types. Correct handling of missing values affects:

KPIs and dashboards (avoid biased averages and counts).
Experiments and funnel analysis (don’t drop critical rows by accident).
Feature creation and modeling (proper nullable types prevent silent coercions).

Typical day-to-day tasks this unlocks

Reading CSVs with custom NA markers and preserving integers with missing values.
Creating robust filters and groupby aggregations that respect missing data.
Merging datasets and coalescing fields safely without breaking types.

Note: You can take the quick test on this page. Anyone can practice; saved progress is available to logged-in users.

Who this is for

Aspiring or junior Data Analysts who work primarily in pandas.
Analysts migrating from spreadsheets where blanks behave differently.
Anyone tired of float-ifying integer columns just because of a single NA.

Prerequisites

Basic pandas: DataFrame/Series, selection, groupby.
Python basics: lists, dicts, functions.
Comfort with reading CSVs and simple data cleaning.

Concept explained simply

Missing values are placeholders for "we don’t know this yet". Pandas supports a dedicated missing marker pd.NA and a family of nullable dtypes so you can keep correct types even when data is incomplete.

Detect: isna()/isnull() (same), notna().
Handle: fillna(), dropna(), interpolate(), where()/mask().
Convert: convert_dtypes() (auto), or astype('Int64'), astype('Float64'), astype('boolean'), astype('string').

Mental model

Think of three layers:

Representation: What is missing? (pd.NA, np.nan, NaT, blanks, or sentinels like -1)
Type: Can this column store missing values without changing its logical type? (Use nullable dtypes.)
Decision: For each analysis step, choose to drop, impute, or keep missing values. Always make the choice explicit.

Nullable dtypes in pandas

Integers: Int8/16/32/64 store integers and <NA> without casting to float.
Unsigned Integers: UInt8/16/32/64 plus <NA>.
Floats: Float32/64 nullable versions using <NA> (distinct from classic float64 with np.nan).
Booleans: boolean (lowercase) supports True/False/<NA> with three-valued logic.
Strings: string (not object) supports <NA>.

Datetime-like types commonly use NaT for missing, but isna() handles them consistently. You can mix pd.NA and NaT in workflows; use isna() to check missing uniformly.

Key behaviors to remember

Comparisons: Any comparison with pd.NA yields <NA>, not True/False. Use isna()/notna().
Boolean logic: With dtype boolean, True & <NA> is <NA>; False & <NA> is False.
Arithmetic: Operations with <NA> often yield <NA>. Impute or fillna() if needed.

Worked examples

1) Keep integers as integers with missing

import pandas as pd
s = pd.Series([1, 2, None, 4])           # float64 because of None
s = s.astype('Int64')                    # now nullable integer
a_median = int(s.dropna().median())      # 2
s_filled = s.fillna(a_median)            # OK: stays Int64
print(s.dtype, s_filled.tolist())
# Int64 [1, 2, 2, 4]

Why this works

Using Int64 avoids the classic float-cast for missing integers. Filling with an int preserves the dtype.

2) Nullable boolean and safe filtering

df = pd.DataFrame({'opted_in': [True, pd.NA, False, pd.NA]})
df['opted_in'] = df['opted_in'].astype('boolean')
only_opted = df[df['opted_in'] == True]  # explicit comparison, avoids <NA> ambiguity
rate = df['opted_in'].mean(skipna=True)  # ignores <NA> by default
print(only_opted.index.tolist(), round(rate, 2))
# [0] 0.5

Tip

A direct filter like df[df['opted_in']] drops <NA> rows (treated as Falsey). Be explicit if needed.

3) Replace sentinels and coalesce after merge

left = pd.DataFrame({'id': [1,2,3], 'city_left': ['NY', '-1', 'SF']})
right = pd.DataFrame({'id': [1,2,3], 'city_right': ['NYC', None, 'San Fran']})
merged = left.merge(right, on='id', how='left')
merged['city_left'] = merged['city_left'].replace('-1', pd.NA).astype('string')
merged['city'] = merged['city_right'].astype('string').combine_first(merged['city_left'])
print(merged[['id','city']])
#    id      city
# 0   1       NYC
# 1   2      <NA>
# 2   3  San Fran

Why combine_first

combine_first picks non-missing values from the first Series, then falls back to the second. Safer than manual fillna chains after merges.

4) Reading CSV with custom missing markers

# Suppose the file uses '', 'NA', and 'unknown' as missing
# and we want an integer id with missing allowed.
df = pd.read_csv(
    'file.csv',
    na_values=['', 'NA', 'unknown'],
    keep_default_na=True,
    dtype={'id': 'Int64'}
)
# Optionally align dtypes across the frame
df = df.convert_dtypes()

Step-by-step: Adopting nullable dtypes

Standardize missing markers: Replace blanks, 'NA', -1, 'unknown' with pd.NA.
Choose dtypes: Use convert_dtypes() then explicitly set critical columns (e.g., astype('Int64')).
Decide per column: Drop, impute, or keep missing—and document your choice.
Validate: Re-check dtypes and counts: df.dtypes, df.isna().sum().
Lock in: Save cleaned datasets to a reliable format (e.g., Parquet) preserving dtypes.

Common mistakes (and self-check)

Mistake: Comparing with == to find missing. Fix: Use isna()/notna().
Mistake: Letting one NA force integers to float. Fix: Use Int64 and friends.
Mistake: Blind dropna() causing data loss. Fix: Use subset= and consider imputation.
Mistake: Filling booleans with strings like 'False'. Fix: Convert to boolean dtype first; fill with True/False or pd.NA.

Self-check quick scan

Do df.isna().sum() before and after cleaning.
Print df.dtypes and confirm logical types (Int64, Float64, boolean, string).
Spot-check 5–10 rows around previously missing values.

Exercises

Mirror of the exercises below. Try to solve before opening the solutions.

Exercise 1 — Clean and convert with nullable dtypes

Goal: Replace custom missing markers, convert to nullable dtypes, and impute.

Create a DataFrame with columns age, income, subscribed using the given data in the exercise card below.
Treat '', 'NA', and -1 as missing where appropriate.
Convert: age -> Int64, income -> Float64, subscribed -> boolean (map yes/no -> True/False).
Impute: age with median (integer), income with median, subscribed with False.
Verify dtypes and that there are 0 missing values.

Exercise 2 — Conditional imputation with mask/where

Goal: Fill missing discounts differently by region.

Ensure discount is Float64.
For region 'A', fill missing discounts with 0.0.
For other regions, fill missing with the overall mean of existing discounts.
Print the final discount list and dtype.

Checklist before you move on

You replaced inconsistent missing markers with pd.NA.
You used nullable dtypes instead of default numpy ones where needed.
Your boolean logic didn’t accidentally drop <NA> rows.
You validated changes with isna().sum() and dtypes.

Practical projects

Customer data audit: Load two CSVs (customers, orders), standardize missing, convert to nullable dtypes, merge, and produce a data quality report (missing by column, by segment).
Pricing gaps: Given product prices over months, identify where price is missing, forward-fill within product, then flag products still missing after fill as needing manual review.
Subscription funnel: Build a boolean/nullable-boolean feature set for steps (viewed, trial_started, paid). Carefully count conversion rates handling <NA> correctly.

Learning path

Before this: Reading Data with pandas - CSV/Parquet, date parsing.
This subskill: Missing values and nullable dtypes.
Next: Groupby nuances with missing categories, joins, and coalescing fields.

Next steps

Refactor one of your past notebooks to use nullable dtypes and recheck metrics.
Adopt a standard missing-data policy for your team: markers, dtypes, and imputation rules.
Take the quick test below to confirm mastery.

Mini challenge

You have a column country_code with values like 'US', 'GB', '', 'N/A', and None. Convert it to string dtype, normalize missing to <NA>, and compute the top 3 countries by count excluding <NA>. How would you verify there are no stray placeholders left?

Menu

Working With Missing and Nullable Dtypes

Table of Contents