How to learn Data Validation With Assertions for Python pandas in Data Analyst for free

Why this matters

As a Data Analyst, you often load raw data, clean it, and deliver insights or dashboards. A single malformed value (like a negative price or a bad date) can silently break metrics. Assertions are lightweight, code-level guards that stop bad data early. They help you:

Fail fast when data quality is off.
Document assumptions (e.g., ages are 0–120, IDs are unique).
Keep pipelines predictable and easier to debug.

Typical real tasks where assertions help:

Verifying primary keys are unique before joining tables.
Ensuring required columns are present and non-empty.
Checking numeric ranges (e.g., discount between 0 and 1).
Guarding aggregations (e.g., shares sum to ~100%).

Concept explained simply

An assertion is a check you expect to be true. If it’s false, Python raises an AssertionError and stops the run. In pandas, you build a boolean test with Series/DataFrame methods and assert that it’s all true.

import pandas as pd

# Example: all ages are between 0 and 120
assert df["age"].between(0, 120).all(), "Age out of expected range (0–120)"

Mental model

Think of assertions as gates in your data pipeline:

Gate 1: After loading data (columns present, dtypes, no unexpected nulls).
Gate 2: After key transformations (derived columns valid, ranges okay).
Gate 3: Before exporting or visualizing (aggregates sensible, uniqueness preserved).

Each gate asks: “Do we still trust the data?” If not, stop and fix.

Worked examples

Example 1: Range and missing checks

import pandas as pd

df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "age": [25, 41, 0],
    "income": [52000, 68000, 30000]
})

# No missing ages
assert df["age"].notna().all(), "Missing ages found"
# Ages between 0 and 120 inclusive
assert df["age"].between(0, 120).all(), "Age out of expected range (0–120)"

Why: Prevents silent nulls and out-of-bound values from skewing means/segments.

Example 2: Unique key and duplicates

df = pd.DataFrame({
    "order_id": [1001, 1002, 1002],
    "amount": [19.9, 25.0, 25.0]
})

assert df["order_id"].is_unique, "order_id must be unique"

Why: Merges or aggregations assume unique keys. A duplicate can double-count revenue.

Example 3: Category membership

allowed = {"bronze", "silver", "gold"}
df = pd.DataFrame({"tier": ["silver", "gold", "bronze"]})

assert df["tier"].isin(allowed).all(), "Unexpected tier value found"

Why: Prevents typos or new, unhandled categories from slipping in.

Example 4: Totals approximately sum to 1

import numpy as np

parts = pd.Series([0.5, 0.3, 0.2])
assert np.isclose(parts.sum(), 1.0, atol=1e-6), "Parts must sum to ~1.0"

Why: Floating-point arithmetic is imprecise; use a tolerance.

Step-by-step: add assertions to your pandas pipeline

Define assumptions: Write down what must be true (columns, types, ranges, uniqueness).
Place gates: After load, after key transforms, before output.
Write precise asserts with messages: The message should tell you what broke and where.
Use tolerances: For floats, use np.isclose with atol or rtol.
Bundle checks: Create small helper functions (e.g., validate_customers(df)).
Fail fast in dev, log in prod: In production, you might wrap asserts to log errors and stop downstream steps.

Exercises (hands-on)

Mirror of graded exercises below. Run these in a local notebook or script.

Exercise 1 (ex1): Customer ages

Goal: Ensure ages are present and between 0 and 120. Data should pass.

import pandas as pd

df = pd.DataFrame({
    "user_id": [101, 102, 103, 104],
    "age": [22, 35, 44, 0]
})

# Write assertions here

Write asserts so the script runs with no error.
Include clear error messages.

Exercise 2 (ex2): Orders integrity

Goal: Ensure order_id is unique and status is from an allowed set. Data has a duplicate that should fail.

import pandas as pd

df = pd.DataFrame({
    "order_id": [1001, 1002, 1002, 1003],
    "status": ["paid", "shipped", "paid", "canceled"]
})

allowed = {"paid", "shipped", "canceled", "refunded"}

# Write assertions here

Expect an AssertionError on uniqueness with your custom message.

Exercise 3 (ex3): Campaign allocation

Goal: Ensure shares sum to ~1 and are non-negative. Data should pass.

import pandas as pd
import numpy as np

alloc = pd.Series([0.40, 0.35, 0.25])

# Write assertions here

Use np.isclose with an absolute tolerance.

Self-check checklist

Did you include clear, specific messages for every assert?
Do all range checks cover lower and upper bounds?
Do category checks use a set of allowed values?
Did you use tolerances for floating-point comparisons?
Are key uniqueness checks present before merges/joins?

Common mistakes and how to catch them

Checking the mask, not its .all(): Using assert df["age"].between(0,120) will always be True (non-empty Series is truthy) or raise a ValueError. Correct: assert df["age"].between(0,120).all().
Forgetting null checks: between ignores NaN. Add notna().all() when needed.
No tolerance on floats: Exact equality on sums can fail due to floating-point error; use np.isclose.
Ambiguous messages: “assert failed” doesn’t help. Include column names and expected range/set.
Asserting before converting types: Parse/convert first (e.g., to_datetime) then validate.

Quick self-audit snippet

# Replace <col> and values, run after each pipeline stage
assert set(required_cols).issubset(df.columns), f"Missing columns: {set(required_cols) - set(df.columns)}"
assert df["id"].is_unique, "id must be unique"
assert df["price"].ge(0).all(), "Negative prices found"

Practical projects to cement learning

Sales pipeline gate: Load a CSV of orders, assert schema (columns/types), assert order_id uniqueness, assert non-negative amounts, and assert status in allowed set. Export only if all checks pass.
Customer demographics validation: After cleaning, assert age range, country in allowed ISO list, and completion rate per row >= 0.8.
Marketing spend consistency: Assert channel shares per campaign sum to ~1, spends are non-negative, and weekly dates are monotonic increasing.

Who this is for

Aspiring and practicing Data Analysts who build pandas pipelines.
Anyone who wants reliable, reproducible data transformations.

Prerequisites

Basic Python (functions, booleans, exceptions).
pandas fundamentals (DataFrame, Series, indexing, boolean masks).
Numpy basics for numeric checks (optional but helpful).

Learning path

pandas basics: selection, filtering, groupby.
Data cleaning: types, missing values, parsing dates.
This module: Assertions for validation gates.
Testing data code: parametrized checks, fixtures, sample data.
Automation: Schedule pipelines and alert on failed gates.

Next steps

Refactor assertions into reusable validation functions.
Add summary logs showing which checks ran and passed.
Extend checks to joins (row count invariants) and aggregations (sanity bounds).

Mini challenge

You receive daily inventory snapshots with columns: sku, date, stock_level. Write three assertions:

sku is unique per date.
date is a valid date and not in the future.
stock_level is integer and non-negative.

Show a possible approach

import pandas as pd
import numpy as np

# assume df has columns: sku, date, stock_level
assert {"sku", "date", "stock_level"}.issubset(df.columns), "Missing required columns"
df["date"] = pd.to_datetime(df["date"], errors="coerce")
assert df["date"].notna().all(), "Invalid dates found"
assert (df["date"] <= pd.Timestamp.today().normalize()).all(), "Future dates found"
assert df.dtypes["stock_level"] in ("int64", "int32") or pd.api.types.is_integer_dtype(df["stock_level"]), "stock_level must be integer"
assert df["stock_level"].ge(0).all(), "Negative stock levels found"
assert (df.groupby("date")["sku"].nunique() == df.groupby("date")["sku"].size()).all(), "Duplicate sku per date"

Quick Test and progress

Take the Quick Test below to check your understanding. Everyone can take the test; only logged-in users will have their progress saved.

Menu

Data Validation With Assertions

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

Step-by-step: add assertions to your pandas pipeline

Exercises (hands-on)

Self-check checklist

Common mistakes and how to catch them

Practical projects to cement learning

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Quick Test and progress

Practice Exercises

Customer ages — range and missing checks

Instructions

Expected Output

Orders integrity — unique key and category membership

Campaign allocation — shares sum to ~1 and non-negative

Data Validation With Assertions — Quick Test

Have questions about Data Validation With Assertions?

AI Assistant