How to learn Automated Tests For Data And Code for CI CD for ML in Machine Learning Engineer for free

Who this is for

Machine Learning Engineers, Data Scientists, and MLOps practitioners who ship models and data pipelines and want reliable, repeatable deployments.

Prerequisites

Basic Python and unit testing concepts
Familiarity with dataframes (e.g., pandas) and ML workflows (train-evaluate-serve)
Comfort with command-line and Git

Why this matters

In ML, code can be correct while results are wrong due to bad data, changed distributions, or mismatched preprocessing. Automated tests catch these issues before they ship. Real tasks you will perform:

Fail a build if input data schema changes or null rates spike
Prevent regressions when refactoring feature code
Verify model outputs are well-formed (shapes, probabilities, ranges)
Guard evaluation metrics so the model cannot degrade silently
Detect data drift early and alert the team

Concept explained simply

Automated tests are small programs that quickly check your assumptions. For ML, we test two things: the code (feature functions, training, serving) and the data (schema, quality, drift). If either breaks, your pipeline fails fast.

Mental model

Contracts at boundaries: Define what must be true at each boundary: raw data in, features out, model in/out, metrics after training.
Small, fast checks: Prefer tiny datasets and focused assertions that run in seconds.
Representative samples: Use a “golden” sample dataset that mimics real shape and edge cases.
Deterministic by default: Fix seeds and isolate randomness so tests aren’t flaky.

Test types and where they live

Unit tests (code): Feature functions, preprocessing, utility code. Fast and frequent.
Data quality tests (data): Schema, nulls, ranges, unique keys, categorical sets.
Integration tests (pipeline): End-to-end on a tiny dataset: train, evaluate, produce a model artifact.
Contract tests (serving): Input/output shapes, dtype, probability constraints.
Regression tests (metrics): Compare metrics to a saved baseline; fail if worse beyond a threshold.
Drift/sanity checks (production-sim): Detect significant distribution shifts vs. reference.

Quick setup steps (copy these into your repo)

Create a tests/ directory: tests/unit/, tests/data/, tests/integration/.
Add a small golden dataset under tests/resources/ (10–100 rows or a few images).
Seed randomness in tests (e.g., numpy, random, torch, sklearn).
Mock or stub external I/O (cloud, databases); keep tests local.
Set metric thresholds (e.g., AUC cannot drop by more than 1%).

Worked examples

Example 1 — Data schema and quality checks

# Pseudocode (Python-style)
import pandas as pd

def test_raw_schema_and_quality():
    df = pd.read_csv('tests/resources/golden.csv')
    # Schema: required columns
    required = {'user_id':'int64','age':'float64','country':'object','label':'int64'}
    assert set(required.keys()).issubset(df.columns)
    for col, dtype in required.items():
        assert str(df[col].dtype) == dtype
    # Quality: null rate thresholds
    assert df['user_id'].isna().mean() == 0
    assert df['age'].between(0, 120).mean() > 0.98
    assert df['label'].isin([0,1]).mean() == 1.0

Example 2 — Unit test for a feature function

def normalize_age(age, min_age=0, max_age=100):
    if age is None:
        return 0.0
    clipped = max(min_age, min(max_age, age))
    return (clipped - min_age) / (max_age - min_age)

def test_normalize_age_edges():
    assert normalize_age(None) == 0.0
    assert normalize_age(-5) == 0.0
    assert normalize_age(150) == 1.0
    assert 0.49 < normalize_age(49) < 0.51

Example 3 — Integration test for training pipeline

def train_pipeline(data_path, seed=42):
    import random, numpy as np
    random.seed(seed); np.random.seed(seed)
    # load, split, fit, evaluate; return metrics and model path
    return {'auc':0.83, 'f1':0.72}, 'artifacts/model.bin'

def test_integration_train_on_golden(tmp_path):
    metrics, model_path = train_pipeline('tests/resources/golden.csv', seed=7)
    # Guardrails
    assert metrics['auc'] >= 0.80
    assert metrics['f1']  >= 0.70
    # Artifact exists
    import os
    assert os.path.exists(model_path)

Example 4 — Prediction contract and probability checks

import numpy as np

def model_predict_proba(X):
    # returns np.array shape (n_samples, 2), rows sum to 1
    pass

def test_prediction_contract():
    X = np.ones((5, 10))
    P = model_predict_proba(X)
    assert P.shape == (5, 2)
    assert np.all(P >= 0) and np.all(P <= 1)
    row_sums = P.sum(axis=1)
    assert np.allclose(row_sums, 1.0, atol=1e-6)

How to use these tests in CI

Run unit and data tests on every pull request. Keep them under 2 minutes.
Run the tiny integration training on critical branches or nightly.
Fail fast: if schema or contract tests fail, skip longer jobs.
Store a small baseline metrics JSON to compare against (regression tests).

Exercises

These exercises have matching items in the Exercises section below. Do them locally with a tiny CSV or an in-memory dataframe.

Exercise 1: Write a data test that fails if any required column is missing, or if null rate in a numeric column exceeds 2%.
Exercise 2: Write a regression test that compares a new AUC to a baseline and fails if the drop is > 0.02.

Checklist for both exercises:
- Use a small golden dataset with at least one edge case
- Make tests deterministic (set seeds, fixed inputs)
- Assertions are clear and include helpful failure messages
- Thresholds are realistic (avoid flaky results)

Note: The quick test below is available to everyone; only logged-in users will have their progress saved.

Common mistakes and self-check

Flaky tests from randomness: Fix with deterministic seeds and stable thresholds.
Overly strict thresholds: Allow tiny tolerance; prefer ranges or deltas.
Testing on full datasets: Makes CI slow; use tiny representative samples.
No data schema checks: Add required columns, dtypes, and allowed categories.
Mismatched preprocessing train vs. serve: Test both paths on the same inputs.
Hidden I/O dependencies: Mock external services; avoid network in tests.
Ignoring drift: Add distribution comparisons on key features.

Self-check prompts

If a column disappears tomorrow, will your CI fail?
If class imbalance shifts by 20%, do you notice?
Can you reproduce yesterday’s metrics byte-for-byte with the same seed?
Do predictions satisfy shape and probability constraints?

Practical projects

Project A — Titanic tiny pipeline: Build unit tests for feature engineering, add data quality tests (nulls, age range), and an integration test that trains a tiny model and asserts accuracy > baseline.
Project B — House prices regression: Add schema checks for numeric vs. categorical features, test that RMSE does not worsen > 3% vs. a stored baseline.
Project C — Image classifier mini: Contract test that input tensors have shape (N,C,H,W), outputs sum to 1 along classes; regression test on top-1 accuracy with a 50-image golden set.

Learning path

Start with unit tests for feature code
Add data schema and quality assertions
Introduce a tiny integration training test
Layer on contract tests for inference
Finally, add regression metrics checks and simple drift detection

Next steps

Create your golden dataset folder and wire tests into your CI job
Decide baseline metrics and acceptable deltas
Document your data contract in a README alongside tests

Mini challenge

Add one new automated test that would have caught a recent bug in your project. Keep it under 30 lines and under 1 second runtime.

Menu

Automated Tests For Data And Code

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Test types and where they live

Worked examples

How to use these tests in CI

Exercises

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Data schema and null-rate test

Instructions

Expected Output

Regression test for metric baseline

Automated Tests For Data And Code — Quick Test

Have questions about Automated Tests For Data And Code?

AI Assistant