luvv to helpDiscover the Best Free Online Tools
Topic 1 of 9

Automated Tests For Data And Code

Learn Automated Tests For Data And Code for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

Machine Learning Engineers, Data Scientists, and MLOps practitioners who ship models and data pipelines and want reliable, repeatable deployments.

Prerequisites

  • Basic Python and unit testing concepts
  • Familiarity with dataframes (e.g., pandas) and ML workflows (train-evaluate-serve)
  • Comfort with command-line and Git

Why this matters

In ML, code can be correct while results are wrong due to bad data, changed distributions, or mismatched preprocessing. Automated tests catch these issues before they ship. Real tasks you will perform:

  • Fail a build if input data schema changes or null rates spike
  • Prevent regressions when refactoring feature code
  • Verify model outputs are well-formed (shapes, probabilities, ranges)
  • Guard evaluation metrics so the model cannot degrade silently
  • Detect data drift early and alert the team

Concept explained simply

Automated tests are small programs that quickly check your assumptions. For ML, we test two things: the code (feature functions, training, serving) and the data (schema, quality, drift). If either breaks, your pipeline fails fast.

Mental model

  • Contracts at boundaries: Define what must be true at each boundary: raw data in, features out, model in/out, metrics after training.
  • Small, fast checks: Prefer tiny datasets and focused assertions that run in seconds.
  • Representative samples: Use a “golden” sample dataset that mimics real shape and edge cases.
  • Deterministic by default: Fix seeds and isolate randomness so tests aren’t flaky.

Test types and where they live

  • Unit tests (code): Feature functions, preprocessing, utility code. Fast and frequent.
  • Data quality tests (data): Schema, nulls, ranges, unique keys, categorical sets.
  • Integration tests (pipeline): End-to-end on a tiny dataset: train, evaluate, produce a model artifact.
  • Contract tests (serving): Input/output shapes, dtype, probability constraints.
  • Regression tests (metrics): Compare metrics to a saved baseline; fail if worse beyond a threshold.
  • Drift/sanity checks (production-sim): Detect significant distribution shifts vs. reference.
Quick setup steps (copy these into your repo)
  1. Create a tests/ directory: tests/unit/, tests/data/, tests/integration/.
  2. Add a small golden dataset under tests/resources/ (10–100 rows or a few images).
  3. Seed randomness in tests (e.g., numpy, random, torch, sklearn).
  4. Mock or stub external I/O (cloud, databases); keep tests local.
  5. Set metric thresholds (e.g., AUC cannot drop by more than 1%).

Worked examples

Example 1 — Data schema and quality checks
# Pseudocode (Python-style)
import pandas as pd

def test_raw_schema_and_quality():
    df = pd.read_csv('tests/resources/golden.csv')
    # Schema: required columns
    required = {'user_id':'int64','age':'float64','country':'object','label':'int64'}
    assert set(required.keys()).issubset(df.columns)
    for col, dtype in required.items():
        assert str(df[col].dtype) == dtype
    # Quality: null rate thresholds
    assert df['user_id'].isna().mean() == 0
    assert df['age'].between(0, 120).mean() > 0.98
    assert df['label'].isin([0,1]).mean() == 1.0
Example 2 — Unit test for a feature function
def normalize_age(age, min_age=0, max_age=100):
    if age is None:
        return 0.0
    clipped = max(min_age, min(max_age, age))
    return (clipped - min_age) / (max_age - min_age)

def test_normalize_age_edges():
    assert normalize_age(None) == 0.0
    assert normalize_age(-5) == 0.0
    assert normalize_age(150) == 1.0
    assert 0.49 < normalize_age(49) < 0.51
Example 3 — Integration test for training pipeline
def train_pipeline(data_path, seed=42):
    import random, numpy as np
    random.seed(seed); np.random.seed(seed)
    # load, split, fit, evaluate; return metrics and model path
    return {'auc':0.83, 'f1':0.72}, 'artifacts/model.bin'

def test_integration_train_on_golden(tmp_path):
    metrics, model_path = train_pipeline('tests/resources/golden.csv', seed=7)
    # Guardrails
    assert metrics['auc'] >= 0.80
    assert metrics['f1']  >= 0.70
    # Artifact exists
    import os
    assert os.path.exists(model_path)
Example 4 — Prediction contract and probability checks
import numpy as np

def model_predict_proba(X):
    # returns np.array shape (n_samples, 2), rows sum to 1
    pass

def test_prediction_contract():
    X = np.ones((5, 10))
    P = model_predict_proba(X)
    assert P.shape == (5, 2)
    assert np.all(P >= 0) and np.all(P <= 1)
    row_sums = P.sum(axis=1)
    assert np.allclose(row_sums, 1.0, atol=1e-6)

How to use these tests in CI

  • Run unit and data tests on every pull request. Keep them under 2 minutes.
  • Run the tiny integration training on critical branches or nightly.
  • Fail fast: if schema or contract tests fail, skip longer jobs.
  • Store a small baseline metrics JSON to compare against (regression tests).

Exercises

These exercises have matching items in the Exercises section below. Do them locally with a tiny CSV or an in-memory dataframe.

  1. Exercise 1: Write a data test that fails if any required column is missing, or if null rate in a numeric column exceeds 2%.
  2. Exercise 2: Write a regression test that compares a new AUC to a baseline and fails if the drop is > 0.02.
  • Checklist for both exercises:
    • Use a small golden dataset with at least one edge case
    • Make tests deterministic (set seeds, fixed inputs)
    • Assertions are clear and include helpful failure messages
    • Thresholds are realistic (avoid flaky results)

Note: The quick test below is available to everyone; only logged-in users will have their progress saved.

Common mistakes and self-check

  • Flaky tests from randomness: Fix with deterministic seeds and stable thresholds.
  • Overly strict thresholds: Allow tiny tolerance; prefer ranges or deltas.
  • Testing on full datasets: Makes CI slow; use tiny representative samples.
  • No data schema checks: Add required columns, dtypes, and allowed categories.
  • Mismatched preprocessing train vs. serve: Test both paths on the same inputs.
  • Hidden I/O dependencies: Mock external services; avoid network in tests.
  • Ignoring drift: Add distribution comparisons on key features.
Self-check prompts
  • If a column disappears tomorrow, will your CI fail?
  • If class imbalance shifts by 20%, do you notice?
  • Can you reproduce yesterday’s metrics byte-for-byte with the same seed?
  • Do predictions satisfy shape and probability constraints?

Practical projects

  • Project A — Titanic tiny pipeline: Build unit tests for feature engineering, add data quality tests (nulls, age range), and an integration test that trains a tiny model and asserts accuracy > baseline.
  • Project B — House prices regression: Add schema checks for numeric vs. categorical features, test that RMSE does not worsen > 3% vs. a stored baseline.
  • Project C — Image classifier mini: Contract test that input tensors have shape (N,C,H,W), outputs sum to 1 along classes; regression test on top-1 accuracy with a 50-image golden set.

Learning path

  1. Start with unit tests for feature code
  2. Add data schema and quality assertions
  3. Introduce a tiny integration training test
  4. Layer on contract tests for inference
  5. Finally, add regression metrics checks and simple drift detection

Next steps

  • Create your golden dataset folder and wire tests into your CI job
  • Decide baseline metrics and acceptable deltas
  • Document your data contract in a README alongside tests

Mini challenge

Add one new automated test that would have caught a recent bug in your project. Keep it under 30 lines and under 1 second runtime.

Practice Exercises

2 exercises to complete

Instructions

Create a test that loads a small dataframe and asserts:

  • Columns: user_id (int), age (float), country (str), label (int) exist
  • Null rate of age is less than or equal to 0.02
  • Label is only 0 or 1

Use a tiny CSV or construct the dataframe in-memory.

Expected Output
Test passes when schema is correct, null rate ≤ 2%, and labels are in {0,1}. Fails with a clear message otherwise.

Automated Tests For Data And Code — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Automated Tests For Data And Code?

AI Assistant

Ask questions about this tool