Who this is for
Machine Learning Engineers, Data Scientists, and MLOps practitioners who ship models and data pipelines and want reliable, repeatable deployments.
Prerequisites
- Basic Python and unit testing concepts
- Familiarity with dataframes (e.g., pandas) and ML workflows (train-evaluate-serve)
- Comfort with command-line and Git
Why this matters
In ML, code can be correct while results are wrong due to bad data, changed distributions, or mismatched preprocessing. Automated tests catch these issues before they ship. Real tasks you will perform:
- Fail a build if input data schema changes or null rates spike
- Prevent regressions when refactoring feature code
- Verify model outputs are well-formed (shapes, probabilities, ranges)
- Guard evaluation metrics so the model cannot degrade silently
- Detect data drift early and alert the team
Concept explained simply
Automated tests are small programs that quickly check your assumptions. For ML, we test two things: the code (feature functions, training, serving) and the data (schema, quality, drift). If either breaks, your pipeline fails fast.
Mental model
- Contracts at boundaries: Define what must be true at each boundary: raw data in, features out, model in/out, metrics after training.
- Small, fast checks: Prefer tiny datasets and focused assertions that run in seconds.
- Representative samples: Use a “golden” sample dataset that mimics real shape and edge cases.
- Deterministic by default: Fix seeds and isolate randomness so tests aren’t flaky.
Test types and where they live
- Unit tests (code): Feature functions, preprocessing, utility code. Fast and frequent.
- Data quality tests (data): Schema, nulls, ranges, unique keys, categorical sets.
- Integration tests (pipeline): End-to-end on a tiny dataset: train, evaluate, produce a model artifact.
- Contract tests (serving): Input/output shapes, dtype, probability constraints.
- Regression tests (metrics): Compare metrics to a saved baseline; fail if worse beyond a threshold.
- Drift/sanity checks (production-sim): Detect significant distribution shifts vs. reference.
Quick setup steps (copy these into your repo)
- Create a tests/ directory: tests/unit/, tests/data/, tests/integration/.
- Add a small golden dataset under tests/resources/ (10–100 rows or a few images).
- Seed randomness in tests (e.g., numpy, random, torch, sklearn).
- Mock or stub external I/O (cloud, databases); keep tests local.
- Set metric thresholds (e.g., AUC cannot drop by more than 1%).
Worked examples
Example 1 — Data schema and quality checks
# Pseudocode (Python-style)
import pandas as pd
def test_raw_schema_and_quality():
df = pd.read_csv('tests/resources/golden.csv')
# Schema: required columns
required = {'user_id':'int64','age':'float64','country':'object','label':'int64'}
assert set(required.keys()).issubset(df.columns)
for col, dtype in required.items():
assert str(df[col].dtype) == dtype
# Quality: null rate thresholds
assert df['user_id'].isna().mean() == 0
assert df['age'].between(0, 120).mean() > 0.98
assert df['label'].isin([0,1]).mean() == 1.0
Example 2 — Unit test for a feature function
def normalize_age(age, min_age=0, max_age=100):
if age is None:
return 0.0
clipped = max(min_age, min(max_age, age))
return (clipped - min_age) / (max_age - min_age)
def test_normalize_age_edges():
assert normalize_age(None) == 0.0
assert normalize_age(-5) == 0.0
assert normalize_age(150) == 1.0
assert 0.49 < normalize_age(49) < 0.51
Example 3 — Integration test for training pipeline
def train_pipeline(data_path, seed=42):
import random, numpy as np
random.seed(seed); np.random.seed(seed)
# load, split, fit, evaluate; return metrics and model path
return {'auc':0.83, 'f1':0.72}, 'artifacts/model.bin'
def test_integration_train_on_golden(tmp_path):
metrics, model_path = train_pipeline('tests/resources/golden.csv', seed=7)
# Guardrails
assert metrics['auc'] >= 0.80
assert metrics['f1'] >= 0.70
# Artifact exists
import os
assert os.path.exists(model_path)
Example 4 — Prediction contract and probability checks
import numpy as np
def model_predict_proba(X):
# returns np.array shape (n_samples, 2), rows sum to 1
pass
def test_prediction_contract():
X = np.ones((5, 10))
P = model_predict_proba(X)
assert P.shape == (5, 2)
assert np.all(P >= 0) and np.all(P <= 1)
row_sums = P.sum(axis=1)
assert np.allclose(row_sums, 1.0, atol=1e-6)
How to use these tests in CI
- Run unit and data tests on every pull request. Keep them under 2 minutes.
- Run the tiny integration training on critical branches or nightly.
- Fail fast: if schema or contract tests fail, skip longer jobs.
- Store a small baseline metrics JSON to compare against (regression tests).
Exercises
These exercises have matching items in the Exercises section below. Do them locally with a tiny CSV or an in-memory dataframe.
- Exercise 1: Write a data test that fails if any required column is missing, or if null rate in a numeric column exceeds 2%.
- Exercise 2: Write a regression test that compares a new AUC to a baseline and fails if the drop is > 0.02.
- Checklist for both exercises:
- Use a small golden dataset with at least one edge case
- Make tests deterministic (set seeds, fixed inputs)
- Assertions are clear and include helpful failure messages
- Thresholds are realistic (avoid flaky results)
Note: The quick test below is available to everyone; only logged-in users will have their progress saved.
Common mistakes and self-check
- Flaky tests from randomness: Fix with deterministic seeds and stable thresholds.
- Overly strict thresholds: Allow tiny tolerance; prefer ranges or deltas.
- Testing on full datasets: Makes CI slow; use tiny representative samples.
- No data schema checks: Add required columns, dtypes, and allowed categories.
- Mismatched preprocessing train vs. serve: Test both paths on the same inputs.
- Hidden I/O dependencies: Mock external services; avoid network in tests.
- Ignoring drift: Add distribution comparisons on key features.
Self-check prompts
- If a column disappears tomorrow, will your CI fail?
- If class imbalance shifts by 20%, do you notice?
- Can you reproduce yesterday’s metrics byte-for-byte with the same seed?
- Do predictions satisfy shape and probability constraints?
Practical projects
- Project A — Titanic tiny pipeline: Build unit tests for feature engineering, add data quality tests (nulls, age range), and an integration test that trains a tiny model and asserts accuracy > baseline.
- Project B — House prices regression: Add schema checks for numeric vs. categorical features, test that RMSE does not worsen > 3% vs. a stored baseline.
- Project C — Image classifier mini: Contract test that input tensors have shape (N,C,H,W), outputs sum to 1 along classes; regression test on top-1 accuracy with a 50-image golden set.
Learning path
- Start with unit tests for feature code
- Add data schema and quality assertions
- Introduce a tiny integration training test
- Layer on contract tests for inference
- Finally, add regression metrics checks and simple drift detection
Next steps
- Create your golden dataset folder and wire tests into your CI job
- Decide baseline metrics and acceptable deltas
- Document your data contract in a README alongside tests
Mini challenge
Add one new automated test that would have caught a recent bug in your project. Keep it under 30 lines and under 1 second runtime.