Menu

Topic 25 of 30

Working With Missing and Nullable Dtypes

Learn Working With Missing and Nullable Dtypes for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

Real datasets are messy. As a Data Analyst, you will constantly handle blanks, NAs, special values like -1 or "unknown", and mixed types. Correct handling of missing values affects:

  • KPIs and dashboards (avoid biased averages and counts).
  • Experiments and funnel analysis (don’t drop critical rows by accident).
  • Feature creation and modeling (proper nullable types prevent silent coercions).
Typical day-to-day tasks this unlocks
  • Reading CSVs with custom NA markers and preserving integers with missing values.
  • Creating robust filters and groupby aggregations that respect missing data.
  • Merging datasets and coalescing fields safely without breaking types.

Note: You can take the quick test on this page. Anyone can practice; saved progress is available to logged-in users.

Who this is for

  • Aspiring or junior Data Analysts who work primarily in pandas.
  • Analysts migrating from spreadsheets where blanks behave differently.
  • Anyone tired of float-ifying integer columns just because of a single NA.

Prerequisites

  • Basic pandas: DataFrame/Series, selection, groupby.
  • Python basics: lists, dicts, functions.
  • Comfort with reading CSVs and simple data cleaning.

Concept explained simply

Missing values are placeholders for "we don’t know this yet". Pandas supports a dedicated missing marker pd.NA and a family of nullable dtypes so you can keep correct types even when data is incomplete.

  • Detect: isna()/isnull() (same), notna().
  • Handle: fillna(), dropna(), interpolate(), where()/mask().
  • Convert: convert_dtypes() (auto), or astype('Int64'), astype('Float64'), astype('boolean'), astype('string').
Mental model

Think of three layers:

  1. Representation: What is missing? (pd.NA, np.nan, NaT, blanks, or sentinels like -1)
  2. Type: Can this column store missing values without changing its logical type? (Use nullable dtypes.)
  3. Decision: For each analysis step, choose to drop, impute, or keep missing values. Always make the choice explicit.

Nullable dtypes in pandas

  • Integers: Int8/16/32/64 store integers and <NA> without casting to float.
  • Unsigned Integers: UInt8/16/32/64 plus <NA>.
  • Floats: Float32/64 nullable versions using <NA> (distinct from classic float64 with np.nan).
  • Booleans: boolean (lowercase) supports True/False/<NA> with three-valued logic.
  • Strings: string (not object) supports <NA>.

Datetime-like types commonly use NaT for missing, but isna() handles them consistently. You can mix pd.NA and NaT in workflows; use isna() to check missing uniformly.

Key behaviors to remember
  • Comparisons: Any comparison with pd.NA yields <NA>, not True/False. Use isna()/notna().
  • Boolean logic: With dtype boolean, True & <NA> is <NA>; False & <NA> is False.
  • Arithmetic: Operations with <NA> often yield <NA>. Impute or fillna() if needed.

Worked examples

1) Keep integers as integers with missing

import pandas as pd
s = pd.Series([1, 2, None, 4])           # float64 because of None
s = s.astype('Int64')                    # now nullable integer
a_median = int(s.dropna().median())      # 2
s_filled = s.fillna(a_median)            # OK: stays Int64
print(s.dtype, s_filled.tolist())
# Int64 [1, 2, 2, 4]
Why this works

Using Int64 avoids the classic float-cast for missing integers. Filling with an int preserves the dtype.

2) Nullable boolean and safe filtering

df = pd.DataFrame({'opted_in': [True, pd.NA, False, pd.NA]})
df['opted_in'] = df['opted_in'].astype('boolean')
only_opted = df[df['opted_in'] == True]  # explicit comparison, avoids <NA> ambiguity
rate = df['opted_in'].mean(skipna=True)  # ignores <NA> by default
print(only_opted.index.tolist(), round(rate, 2))
# [0] 0.5
Tip

A direct filter like df[df['opted_in']] drops <NA> rows (treated as Falsey). Be explicit if needed.

3) Replace sentinels and coalesce after merge

left = pd.DataFrame({'id': [1,2,3], 'city_left': ['NY', '-1', 'SF']})
right = pd.DataFrame({'id': [1,2,3], 'city_right': ['NYC', None, 'San Fran']})
merged = left.merge(right, on='id', how='left')
merged['city_left'] = merged['city_left'].replace('-1', pd.NA).astype('string')
merged['city'] = merged['city_right'].astype('string').combine_first(merged['city_left'])
print(merged[['id','city']])
#    id      city
# 0   1       NYC
# 1   2      <NA>
# 2   3  San Fran
Why combine_first

combine_first picks non-missing values from the first Series, then falls back to the second. Safer than manual fillna chains after merges.

4) Reading CSV with custom missing markers

# Suppose the file uses '', 'NA', and 'unknown' as missing
# and we want an integer id with missing allowed.
df = pd.read_csv(
    'file.csv',
    na_values=['', 'NA', 'unknown'],
    keep_default_na=True,
    dtype={'id': 'Int64'}
)
# Optionally align dtypes across the frame
df = df.convert_dtypes()

Step-by-step: Adopting nullable dtypes

  1. Standardize missing markers: Replace blanks, 'NA', -1, 'unknown' with pd.NA.
  2. Choose dtypes: Use convert_dtypes() then explicitly set critical columns (e.g., astype('Int64')).
  3. Decide per column: Drop, impute, or keep missing—and document your choice.
  4. Validate: Re-check dtypes and counts: df.dtypes, df.isna().sum().
  5. Lock in: Save cleaned datasets to a reliable format (e.g., Parquet) preserving dtypes.

Common mistakes (and self-check)

  • Mistake: Comparing with == to find missing. Fix: Use isna()/notna().
  • Mistake: Letting one NA force integers to float. Fix: Use Int64 and friends.
  • Mistake: Blind dropna() causing data loss. Fix: Use subset= and consider imputation.
  • Mistake: Filling booleans with strings like 'False'. Fix: Convert to boolean dtype first; fill with True/False or pd.NA.
Self-check quick scan
  • Do df.isna().sum() before and after cleaning.
  • Print df.dtypes and confirm logical types (Int64, Float64, boolean, string).
  • Spot-check 5–10 rows around previously missing values.

Exercises

Mirror of the exercises below. Try to solve before opening the solutions.

Exercise 1 — Clean and convert with nullable dtypes

Goal: Replace custom missing markers, convert to nullable dtypes, and impute.

  • Create a DataFrame with columns age, income, subscribed using the given data in the exercise card below.
  • Treat '', 'NA', and -1 as missing where appropriate.
  • Convert: age -> Int64, income -> Float64, subscribed -> boolean (map yes/no -> True/False).
  • Impute: age with median (integer), income with median, subscribed with False.
  • Verify dtypes and that there are 0 missing values.

Exercise 2 — Conditional imputation with mask/where

Goal: Fill missing discounts differently by region.

  • Ensure discount is Float64.
  • For region 'A', fill missing discounts with 0.0.
  • For other regions, fill missing with the overall mean of existing discounts.
  • Print the final discount list and dtype.
Checklist before you move on
  • You replaced inconsistent missing markers with pd.NA.
  • You used nullable dtypes instead of default numpy ones where needed.
  • Your boolean logic didn’t accidentally drop <NA> rows.
  • You validated changes with isna().sum() and dtypes.

Practical projects

  • Customer data audit: Load two CSVs (customers, orders), standardize missing, convert to nullable dtypes, merge, and produce a data quality report (missing by column, by segment).
  • Pricing gaps: Given product prices over months, identify where price is missing, forward-fill within product, then flag products still missing after fill as needing manual review.
  • Subscription funnel: Build a boolean/nullable-boolean feature set for steps (viewed, trial_started, paid). Carefully count conversion rates handling <NA> correctly.

Learning path

  • Before this: Reading Data with pandas - CSV/Parquet, date parsing.
  • This subskill: Missing values and nullable dtypes.
  • Next: Groupby nuances with missing categories, joins, and coalescing fields.

Next steps

  • Refactor one of your past notebooks to use nullable dtypes and recheck metrics.
  • Adopt a standard missing-data policy for your team: markers, dtypes, and imputation rules.
  • Take the quick test below to confirm mastery.

Mini challenge

You have a column country_code with values like 'US', 'GB', '', 'N/A', and None. Convert it to string dtype, normalize missing to <NA>, and compute the top 3 countries by count excluding <NA>. How would you verify there are no stray placeholders left?

Practice Exercises

2 exercises to complete

Instructions

Create the DataFrame and perform cleaning:

import pandas as pd
raw = {
    'age': [25, -1, 41, None, 33, ''],
    'income': [52000, '', 'NA', 71000, -1, 63000],
    'subscribed': ['yes', 'no', None, 'yes', '', 'NA']
}
df = pd.DataFrame(raw)

# Steps to perform:
# 1) Standardize missing markers: '', 'NA', -1 (for numeric) -> pd.NA
# 2) Convert dtypes: age->Int64, income->Float64, subscribed->boolean (map yes/no)
# 3) Impute: age with median (int), income with median, subscribed missing with False
# 4) Print dtypes and df.isna().sum()
Expected Output
age dtype: Int64; income dtype: Float64; subscribed dtype: boolean; missing after fill — age: 0, income: 0, subscribed: 0

Working With Missing and Nullable Dtypes — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Working With Missing and Nullable Dtypes?

AI Assistant

Ask questions about this tool