Why this matters
As a Data Analyst, you often load raw data, clean it, and deliver insights or dashboards. A single malformed value (like a negative price or a bad date) can silently break metrics. Assertions are lightweight, code-level guards that stop bad data early. They help you:
- Fail fast when data quality is off.
- Document assumptions (e.g., ages are 0–120, IDs are unique).
- Keep pipelines predictable and easier to debug.
Typical real tasks where assertions help:
- Verifying primary keys are unique before joining tables.
- Ensuring required columns are present and non-empty.
- Checking numeric ranges (e.g., discount between 0 and 1).
- Guarding aggregations (e.g., shares sum to ~100%).
Concept explained simply
An assertion is a check you expect to be true. If it’s false, Python raises an AssertionError and stops the run. In pandas, you build a boolean test with Series/DataFrame methods and assert that it’s all true.
import pandas as pd
# Example: all ages are between 0 and 120
assert df["age"].between(0, 120).all(), "Age out of expected range (0–120)"
Mental model
Think of assertions as gates in your data pipeline:
- Gate 1: After loading data (columns present, dtypes, no unexpected nulls).
- Gate 2: After key transformations (derived columns valid, ranges okay).
- Gate 3: Before exporting or visualizing (aggregates sensible, uniqueness preserved).
Each gate asks: “Do we still trust the data?” If not, stop and fix.
Worked examples
Example 1: Range and missing checks
import pandas as pd
df = pd.DataFrame({
"user_id": [1, 2, 3],
"age": [25, 41, 0],
"income": [52000, 68000, 30000]
})
# No missing ages
assert df["age"].notna().all(), "Missing ages found"
# Ages between 0 and 120 inclusive
assert df["age"].between(0, 120).all(), "Age out of expected range (0–120)"
Why: Prevents silent nulls and out-of-bound values from skewing means/segments.
Example 2: Unique key and duplicates
df = pd.DataFrame({
"order_id": [1001, 1002, 1002],
"amount": [19.9, 25.0, 25.0]
})
assert df["order_id"].is_unique, "order_id must be unique"
Why: Merges or aggregations assume unique keys. A duplicate can double-count revenue.
Example 3: Category membership
allowed = {"bronze", "silver", "gold"}
df = pd.DataFrame({"tier": ["silver", "gold", "bronze"]})
assert df["tier"].isin(allowed).all(), "Unexpected tier value found"
Why: Prevents typos or new, unhandled categories from slipping in.
Example 4: Totals approximately sum to 1
import numpy as np
parts = pd.Series([0.5, 0.3, 0.2])
assert np.isclose(parts.sum(), 1.0, atol=1e-6), "Parts must sum to ~1.0"
Why: Floating-point arithmetic is imprecise; use a tolerance.
Step-by-step: add assertions to your pandas pipeline
- Define assumptions: Write down what must be true (columns, types, ranges, uniqueness).
- Place gates: After load, after key transforms, before output.
- Write precise asserts with messages: The message should tell you what broke and where.
- Use tolerances: For floats, use np.isclose with atol or rtol.
- Bundle checks: Create small helper functions (e.g., validate_customers(df)).
- Fail fast in dev, log in prod: In production, you might wrap asserts to log errors and stop downstream steps.
Exercises (hands-on)
Mirror of graded exercises below. Run these in a local notebook or script.
Exercise 1 (ex1): Customer ages
Goal: Ensure ages are present and between 0 and 120. Data should pass.
import pandas as pd
df = pd.DataFrame({
"user_id": [101, 102, 103, 104],
"age": [22, 35, 44, 0]
})
# Write assertions here
- Write asserts so the script runs with no error.
- Include clear error messages.
Exercise 2 (ex2): Orders integrity
Goal: Ensure order_id is unique and status is from an allowed set. Data has a duplicate that should fail.
import pandas as pd
df = pd.DataFrame({
"order_id": [1001, 1002, 1002, 1003],
"status": ["paid", "shipped", "paid", "canceled"]
})
allowed = {"paid", "shipped", "canceled", "refunded"}
# Write assertions here
- Expect an AssertionError on uniqueness with your custom message.
Exercise 3 (ex3): Campaign allocation
Goal: Ensure shares sum to ~1 and are non-negative. Data should pass.
import pandas as pd
import numpy as np
alloc = pd.Series([0.40, 0.35, 0.25])
# Write assertions here
- Use np.isclose with an absolute tolerance.
Self-check checklist
- Did you include clear, specific messages for every assert?
- Do all range checks cover lower and upper bounds?
- Do category checks use a set of allowed values?
- Did you use tolerances for floating-point comparisons?
- Are key uniqueness checks present before merges/joins?
Common mistakes and how to catch them
- Checking the mask, not its .all(): Using assert df["age"].between(0,120) will always be True (non-empty Series is truthy) or raise a ValueError. Correct: assert df["age"].between(0,120).all().
- Forgetting null checks: between ignores NaN. Add notna().all() when needed.
- No tolerance on floats: Exact equality on sums can fail due to floating-point error; use np.isclose.
- Ambiguous messages: “assert failed” doesn’t help. Include column names and expected range/set.
- Asserting before converting types: Parse/convert first (e.g., to_datetime) then validate.
Quick self-audit snippet
# Replace <col> and values, run after each pipeline stage
assert set(required_cols).issubset(df.columns), f"Missing columns: {set(required_cols) - set(df.columns)}"
assert df["id"].is_unique, "id must be unique"
assert df["price"].ge(0).all(), "Negative prices found"
Practical projects to cement learning
- Sales pipeline gate: Load a CSV of orders, assert schema (columns/types), assert order_id uniqueness, assert non-negative amounts, and assert status in allowed set. Export only if all checks pass.
- Customer demographics validation: After cleaning, assert age range, country in allowed ISO list, and completion rate per row >= 0.8.
- Marketing spend consistency: Assert channel shares per campaign sum to ~1, spends are non-negative, and weekly dates are monotonic increasing.
Who this is for
- Aspiring and practicing Data Analysts who build pandas pipelines.
- Anyone who wants reliable, reproducible data transformations.
Prerequisites
- Basic Python (functions, booleans, exceptions).
- pandas fundamentals (DataFrame, Series, indexing, boolean masks).
- Numpy basics for numeric checks (optional but helpful).
Learning path
- pandas basics: selection, filtering, groupby.
- Data cleaning: types, missing values, parsing dates.
- This module: Assertions for validation gates.
- Testing data code: parametrized checks, fixtures, sample data.
- Automation: Schedule pipelines and alert on failed gates.
Next steps
- Refactor assertions into reusable validation functions.
- Add summary logs showing which checks ran and passed.
- Extend checks to joins (row count invariants) and aggregations (sanity bounds).
Mini challenge
You receive daily inventory snapshots with columns: sku, date, stock_level. Write three assertions:
- sku is unique per date.
- date is a valid date and not in the future.
- stock_level is integer and non-negative.
Show a possible approach
import pandas as pd
import numpy as np
# assume df has columns: sku, date, stock_level
assert {"sku", "date", "stock_level"}.issubset(df.columns), "Missing required columns"
df["date"] = pd.to_datetime(df["date"], errors="coerce")
assert df["date"].notna().all(), "Invalid dates found"
assert (df["date"] <= pd.Timestamp.today().normalize()).all(), "Future dates found"
assert df.dtypes["stock_level"] in ("int64", "int32") or pd.api.types.is_integer_dtype(df["stock_level"]), "stock_level must be integer"
assert df["stock_level"].ge(0).all(), "Negative stock levels found"
assert (df.groupby("date")["sku"].nunique() == df.groupby("date")["sku"].size()).all(), "Duplicate sku per date"
Quick Test and progress
Take the Quick Test below to check your understanding. Everyone can take the test; only logged-in users will have their progress saved.