Why this matters
In MLOps, tests protect you from costly failures like broken feature engineering, incompatible model artifacts, or an API that returns nonsense predictions after a change. You will use unit, integration, and smoke tests to:
- Fail fast on bugs in data transforms and metrics code.
- Verify that pipelines run end-to-end on a tiny sample before long training jobs consume budget.
- Ensure contracts: schema in, model out, consistent shapes and dtypes, and stable endpoints.
- Confidently merge changes and deploy models with automated gates in CI/CD.
Real tasks you’ll face
- Writing a unit test for a feature scaler that sometimes outputs NaNs.
- Adding an integration test that trains a small logistic regression and saves the model artifact.
- Running a smoke test on a staging endpoint after each deploy to ensure 200 OK and reasonable latency.
Concept explained simply
Three complementary layers of confidence:
- Unit tests: Test one small piece (e.g., function to impute missing values) in isolation. Fast (milliseconds), deterministic, lots of them.
- Integration tests: Test how pieces work together (e.g., load data → preprocess → train → evaluate → save model). Slower (seconds), fewer than unit tests.
- Smoke tests: Quick “does it basically run” checks in CI or post-deploy. Tiny data, minimal steps. Do not check model quality deeply—only safety and basic behavior.
Mental model
Think of a test funnel:
- Wide top (many unit tests) catches most errors cheaply.
- Middle (integration tests) ensures the system works together.
- Narrow bottom (smoke tests) prevents obvious production breakage.
Risk-based selection
Test where risk is highest: schema/feature code, versioned dependencies, serialization formats, and external services (object store, model registry). Use mocks/fakes for external services in unit tests; use temporary local resources for integration tests.
ML-specific testing nuances
- Determinism: Set random seeds and fixed small inputs. Use small, in-repo fixtures.
- Numerical tolerance: Compare floats with tolerance (e.g., absolute or relative tolerance) instead of exact equality.
- Data contracts: Validate schema (columns, dtypes, ranges) before pipeline steps.
- Artifacts: Check that models, metrics, and reports are created with expected names and minimal size.
- Performance budget: Keep CI fast—unit tests under seconds, integration tests under ~1–2 minutes, smoke tests under ~30 seconds.
Worked examples
Example 1: Unit test for a scaler
Goal: Ensure min-max scaler handles NaNs and constant columns.
# src/features/scaler.py
import numpy as np
def minmax_scale(x):
x = np.array(x, dtype=float)
mask = np.isnan(x)
if np.all(mask):
return np.zeros_like(x)
x_no_nan = x[~mask]
lo, hi = np.min(x_no_nan), np.max(x_no_nan)
if hi == lo:
out = np.zeros_like(x)
out[~mask] = 0.0
out[mask] = 0.0
return out
out = (x - lo) / (hi - lo)
out[mask] = 0.0
return out
# tests/unit/test_scaler.py
import numpy as np
from src.features.scaler import minmax_scale
def test_minmax_basic():
got = minmax_scale([0, 5, 10])
assert np.allclose(got, [0.0, 0.5, 1.0])
def test_minmax_nans_and_constant():
got = minmax_scale([np.nan, 3, 3, np.nan])
assert np.allclose(got, [0.0, 0.0, 0.0, 0.0])
What it proves: determinism, safe NaN handling, and contract (same length in, same length out).
Example 2: Integration test for a tiny pipeline
Goal: Run preprocess → train → evaluate → save artifact on a 20-row synthetic dataset.
# src/pipeline/train_small.py
import os, json, joblib
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
def make_data(n=20, seed=0):
rng = np.random.default_rng(seed)
X = rng.normal(size=(n, 3))
y = (X[:,0] + 0.5*X[:,1] > 0).astype(int)
return X, y
def run(output_dir):
os.makedirs(output_dir, exist_ok=True)
X, y = make_data()
clf = LogisticRegression(max_iter=200)
clf.fit(X, y)
preds = clf.predict(X)
acc = accuracy_score(y, preds)
joblib.dump(clf, os.path.join(output_dir, "model.pkl"))
with open(os.path.join(output_dir, "metrics.json"), "w") as f:
json.dump({"accuracy": float(acc)}, f)
return acc
# tests/integration/test_pipeline_small.py
import json, os, tempfile
from src.pipeline.train_small import run
def test_pipeline_runs_and_saves_artifacts():
with tempfile.TemporaryDirectory() as d:
acc = run(d)
assert 0.5 <= acc <= 1.0 # lenient threshold for tiny data
assert os.path.exists(os.path.join(d, "model.pkl"))
with open(os.path.join(d, "metrics.json")) as f:
metrics = json.load(f)
assert "accuracy" in metrics
What it proves: components work together, artifacts exist, metrics are within a reasonable range.
Example 3: Smoke test after deploy
Goal: Check an inference API responds and shape is correct. This is a pseudo-example that you can adapt to your stack.
# tests/smoke/test_inference_smoke.py
import json
import time
def test_inference_smoke():
# Replace with your client; keep it tiny and fast.
# Pseudo-client shown as example only.
def predict(payload):
# e.g., requests.post(...).json()
# Here we emulate a fast response.
return {"predictions": [0.12, 0.87]}
start = time.time()
resp = predict({"instances": [[0.1, -0.2, 0.3], [0.0, 0.0, 0.1]]})
elapsed = time.time() - start
assert "predictions" in resp
preds = resp["predictions"]
assert isinstance(preds, list) and len(preds) == 2
assert elapsed < 0.5 # sanity latency budget for smoke
What it proves: endpoint contract and latency budget. It does not validate model quality.
How to structure tests and run in CI
project/
src/
features/
pipeline/
tests/
unit/
integration/
smoke/
- Run order in CI: unit → integration → smoke. Fail early to save time.
- Use small, synthetic fixtures checked into the repo.
- Cache dependencies. Keep tests parallelizable and independent.
Typical CI commands
- Unit: pytest tests/unit -q
- Integration: pytest tests/integration -q -k small
- Smoke: pytest tests/smoke -q
Exercises
Do these in your project or a fresh folder. They mirror the graded exercises below.
Exercise 1: Unit test a data transform
- Create src/features/scaler.py with a safe min-max scaler handling NaNs and constant columns.
- Write tests in tests/unit/test_scaler.py for typical, NaN-heavy, and constant inputs.
- Run pytest and make all tests pass.
Starter snippet
# src/features/scaler.py
# Implement minmax_scale(x)
Exercise 2: Integration test a tiny training pipeline
- Create src/pipeline/train_small.py that generates tiny synthetic data, trains a lightweight model, and writes model.pkl and metrics.json.
- Write tests in tests/integration/test_pipeline_small.py that assert: training succeeds, accuracy is within [0.5, 1.0], and artifacts exist.
- Keep runtime under 10 seconds.
Quality checklist
- All tests are deterministic (fixed seeds).
- No network calls in unit/integration tests.
- Artifacts saved to a temporary directory.
- Numerical checks use tolerances where needed.
Common mistakes and how to self-check
- No seeding → flaky tests. Self-check: run tests 5 times; results must be identical.
- Exact float equality → brittle. Self-check: use allclose/tolerances.
- Oversized fixtures → slow CI. Self-check: count rows; target tens, not thousands.
- Testing external services live in unit tests → fragile. Self-check: replace with mocks/fakes; move live checks to dedicated environments.
- Smoke tests asserting strict accuracy → unnecessary. Self-check: smoke only checks run/shape/latency.
Learning path
- Start: Write 5–10 unit tests for core transforms and metrics.
- Next: One integration test for the smallest end-to-end path.
- Then: One smoke test per deployable (batch job, API).
- Finally: Add schema checks and artifact validation to the pipeline.
Practical projects
- Add a test pyramid to an existing ML repo: at least 10 unit, 2 integration, 1 smoke test. Measure and record total runtime.
- Create a reusable pytest fixture that generates a tiny tabular dataset and reuse it across 3 tests.
Next steps
- Introduce data schema validation in preprocessing tests.
- Add contract tests for your model API payloads.
- Gate CI merges on passing unit and integration suites; run smoke after deploy.
Mini challenge
Break your scaler (e.g., return None for NaN inputs) and watch the unit test fail. Fix the implementation until tests pass again. Add one extra edge-case test you did not have before.
Progress and Quick Test
The quick test below is available to everyone. If you are logged in, your progress will be saved automatically.