luvv to helpDiscover the Best Free Online Tools
Topic 29 of 30

Data Validation With Assertions

Learn Data Validation With Assertions for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, you often load raw data, clean it, and deliver insights or dashboards. A single malformed value (like a negative price or a bad date) can silently break metrics. Assertions are lightweight, code-level guards that stop bad data early. They help you:

  • Fail fast when data quality is off.
  • Document assumptions (e.g., ages are 0–120, IDs are unique).
  • Keep pipelines predictable and easier to debug.

Typical real tasks where assertions help:

  • Verifying primary keys are unique before joining tables.
  • Ensuring required columns are present and non-empty.
  • Checking numeric ranges (e.g., discount between 0 and 1).
  • Guarding aggregations (e.g., shares sum to ~100%).

Concept explained simply

An assertion is a check you expect to be true. If it’s false, Python raises an AssertionError and stops the run. In pandas, you build a boolean test with Series/DataFrame methods and assert that it’s all true.

import pandas as pd

# Example: all ages are between 0 and 120
assert df["age"].between(0, 120).all(), "Age out of expected range (0–120)"

Mental model

Think of assertions as gates in your data pipeline:

  • Gate 1: After loading data (columns present, dtypes, no unexpected nulls).
  • Gate 2: After key transformations (derived columns valid, ranges okay).
  • Gate 3: Before exporting or visualizing (aggregates sensible, uniqueness preserved).

Each gate asks: “Do we still trust the data?” If not, stop and fix.

Worked examples

Example 1: Range and missing checks
import pandas as pd

df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "age": [25, 41, 0],
    "income": [52000, 68000, 30000]
})

# No missing ages
assert df["age"].notna().all(), "Missing ages found"
# Ages between 0 and 120 inclusive
assert df["age"].between(0, 120).all(), "Age out of expected range (0–120)"

Why: Prevents silent nulls and out-of-bound values from skewing means/segments.

Example 2: Unique key and duplicates
df = pd.DataFrame({
    "order_id": [1001, 1002, 1002],
    "amount": [19.9, 25.0, 25.0]
})

assert df["order_id"].is_unique, "order_id must be unique"

Why: Merges or aggregations assume unique keys. A duplicate can double-count revenue.

Example 3: Category membership
allowed = {"bronze", "silver", "gold"}
df = pd.DataFrame({"tier": ["silver", "gold", "bronze"]})

assert df["tier"].isin(allowed).all(), "Unexpected tier value found"

Why: Prevents typos or new, unhandled categories from slipping in.

Example 4: Totals approximately sum to 1
import numpy as np

parts = pd.Series([0.5, 0.3, 0.2])
assert np.isclose(parts.sum(), 1.0, atol=1e-6), "Parts must sum to ~1.0"

Why: Floating-point arithmetic is imprecise; use a tolerance.

Step-by-step: add assertions to your pandas pipeline

  1. Define assumptions: Write down what must be true (columns, types, ranges, uniqueness).
  2. Place gates: After load, after key transforms, before output.
  3. Write precise asserts with messages: The message should tell you what broke and where.
  4. Use tolerances: For floats, use np.isclose with atol or rtol.
  5. Bundle checks: Create small helper functions (e.g., validate_customers(df)).
  6. Fail fast in dev, log in prod: In production, you might wrap asserts to log errors and stop downstream steps.

Exercises (hands-on)

Mirror of graded exercises below. Run these in a local notebook or script.

Exercise 1 (ex1): Customer ages

Goal: Ensure ages are present and between 0 and 120. Data should pass.

import pandas as pd

df = pd.DataFrame({
    "user_id": [101, 102, 103, 104],
    "age": [22, 35, 44, 0]
})

# Write assertions here
  • Write asserts so the script runs with no error.
  • Include clear error messages.
Exercise 2 (ex2): Orders integrity

Goal: Ensure order_id is unique and status is from an allowed set. Data has a duplicate that should fail.

import pandas as pd

df = pd.DataFrame({
    "order_id": [1001, 1002, 1002, 1003],
    "status": ["paid", "shipped", "paid", "canceled"]
})

allowed = {"paid", "shipped", "canceled", "refunded"}

# Write assertions here
  • Expect an AssertionError on uniqueness with your custom message.
Exercise 3 (ex3): Campaign allocation

Goal: Ensure shares sum to ~1 and are non-negative. Data should pass.

import pandas as pd
import numpy as np

alloc = pd.Series([0.40, 0.35, 0.25])

# Write assertions here
  • Use np.isclose with an absolute tolerance.

Self-check checklist

  • Did you include clear, specific messages for every assert?
  • Do all range checks cover lower and upper bounds?
  • Do category checks use a set of allowed values?
  • Did you use tolerances for floating-point comparisons?
  • Are key uniqueness checks present before merges/joins?

Common mistakes and how to catch them

  • Checking the mask, not its .all(): Using assert df["age"].between(0,120) will always be True (non-empty Series is truthy) or raise a ValueError. Correct: assert df["age"].between(0,120).all().
  • Forgetting null checks: between ignores NaN. Add notna().all() when needed.
  • No tolerance on floats: Exact equality on sums can fail due to floating-point error; use np.isclose.
  • Ambiguous messages: “assert failed” doesn’t help. Include column names and expected range/set.
  • Asserting before converting types: Parse/convert first (e.g., to_datetime) then validate.
Quick self-audit snippet
# Replace <col> and values, run after each pipeline stage
assert set(required_cols).issubset(df.columns), f"Missing columns: {set(required_cols) - set(df.columns)}"
assert df["id"].is_unique, "id must be unique"
assert df["price"].ge(0).all(), "Negative prices found"

Practical projects to cement learning

  • Sales pipeline gate: Load a CSV of orders, assert schema (columns/types), assert order_id uniqueness, assert non-negative amounts, and assert status in allowed set. Export only if all checks pass.
  • Customer demographics validation: After cleaning, assert age range, country in allowed ISO list, and completion rate per row >= 0.8.
  • Marketing spend consistency: Assert channel shares per campaign sum to ~1, spends are non-negative, and weekly dates are monotonic increasing.

Who this is for

  • Aspiring and practicing Data Analysts who build pandas pipelines.
  • Anyone who wants reliable, reproducible data transformations.

Prerequisites

  • Basic Python (functions, booleans, exceptions).
  • pandas fundamentals (DataFrame, Series, indexing, boolean masks).
  • Numpy basics for numeric checks (optional but helpful).

Learning path

  1. pandas basics: selection, filtering, groupby.
  2. Data cleaning: types, missing values, parsing dates.
  3. This module: Assertions for validation gates.
  4. Testing data code: parametrized checks, fixtures, sample data.
  5. Automation: Schedule pipelines and alert on failed gates.

Next steps

  • Refactor assertions into reusable validation functions.
  • Add summary logs showing which checks ran and passed.
  • Extend checks to joins (row count invariants) and aggregations (sanity bounds).

Mini challenge

You receive daily inventory snapshots with columns: sku, date, stock_level. Write three assertions:

  • sku is unique per date.
  • date is a valid date and not in the future.
  • stock_level is integer and non-negative.
Show a possible approach
import pandas as pd
import numpy as np

# assume df has columns: sku, date, stock_level
assert {"sku", "date", "stock_level"}.issubset(df.columns), "Missing required columns"
df["date"] = pd.to_datetime(df["date"], errors="coerce")
assert df["date"].notna().all(), "Invalid dates found"
assert (df["date"] <= pd.Timestamp.today().normalize()).all(), "Future dates found"
assert df.dtypes["stock_level"] in ("int64", "int32") or pd.api.types.is_integer_dtype(df["stock_level"]), "stock_level must be integer"
assert df["stock_level"].ge(0).all(), "Negative stock levels found"
assert (df.groupby("date")["sku"].nunique() == df.groupby("date")["sku"].size()).all(), "Duplicate sku per date"

Quick Test and progress

Take the Quick Test below to check your understanding. Everyone can take the test; only logged-in users will have their progress saved.

Practice Exercises

3 exercises to complete

Instructions

Create assertions to ensure age is present and within 0–120 inclusive. Use the provided DataFrame. The code should run without errors.

import pandas as pd

df = pd.DataFrame({
    "user_id": [101, 102, 103, 104],
    "age": [22, 35, 44, 0]
})

# Your assertions here
Expected Output
No output (assertions pass).

Data Validation With Assertions — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Data Validation With Assertions?

AI Assistant

Ask questions about this tool