How to learn Unit Integration And Smoke Tests for CI CD For ML Systems in MLOps Engineer for free

Why this matters

In MLOps, tests protect you from costly failures like broken feature engineering, incompatible model artifacts, or an API that returns nonsense predictions after a change. You will use unit, integration, and smoke tests to:

Fail fast on bugs in data transforms and metrics code.
Verify that pipelines run end-to-end on a tiny sample before long training jobs consume budget.
Ensure contracts: schema in, model out, consistent shapes and dtypes, and stable endpoints.
Confidently merge changes and deploy models with automated gates in CI/CD.

Real tasks you’ll face

Writing a unit test for a feature scaler that sometimes outputs NaNs.
Adding an integration test that trains a small logistic regression and saves the model artifact.
Running a smoke test on a staging endpoint after each deploy to ensure 200 OK and reasonable latency.

Concept explained simply

Three complementary layers of confidence:

Unit tests: Test one small piece (e.g., function to impute missing values) in isolation. Fast (milliseconds), deterministic, lots of them.
Integration tests: Test how pieces work together (e.g., load data → preprocess → train → evaluate → save model). Slower (seconds), fewer than unit tests.
Smoke tests: Quick “does it basically run” checks in CI or post-deploy. Tiny data, minimal steps. Do not check model quality deeply—only safety and basic behavior.

Mental model

Think of a test funnel:

Wide top (many unit tests) catches most errors cheaply.
Middle (integration tests) ensures the system works together.
Narrow bottom (smoke tests) prevents obvious production breakage.

Risk-based selection

Test where risk is highest: schema/feature code, versioned dependencies, serialization formats, and external services (object store, model registry). Use mocks/fakes for external services in unit tests; use temporary local resources for integration tests.

ML-specific testing nuances

Determinism: Set random seeds and fixed small inputs. Use small, in-repo fixtures.
Numerical tolerance: Compare floats with tolerance (e.g., absolute or relative tolerance) instead of exact equality.
Data contracts: Validate schema (columns, dtypes, ranges) before pipeline steps.
Artifacts: Check that models, metrics, and reports are created with expected names and minimal size.
Performance budget: Keep CI fast—unit tests under seconds, integration tests under ~1–2 minutes, smoke tests under ~30 seconds.

Worked examples

Example 1: Unit test for a scaler

Goal: Ensure min-max scaler handles NaNs and constant columns.

# src/features/scaler.py
import numpy as np

def minmax_scale(x):
    x = np.array(x, dtype=float)
    mask = np.isnan(x)
    if np.all(mask):
        return np.zeros_like(x)
    x_no_nan = x[~mask]
    lo, hi = np.min(x_no_nan), np.max(x_no_nan)
    if hi == lo:
        out = np.zeros_like(x)
        out[~mask] = 0.0
        out[mask] = 0.0
        return out
    out = (x - lo) / (hi - lo)
    out[mask] = 0.0
    return out

# tests/unit/test_scaler.py
import numpy as np
from src.features.scaler import minmax_scale

def test_minmax_basic():
    got = minmax_scale([0, 5, 10])
    assert np.allclose(got, [0.0, 0.5, 1.0])

def test_minmax_nans_and_constant():
    got = minmax_scale([np.nan, 3, 3, np.nan])
    assert np.allclose(got, [0.0, 0.0, 0.0, 0.0])

What it proves: determinism, safe NaN handling, and contract (same length in, same length out).

Example 2: Integration test for a tiny pipeline

Goal: Run preprocess → train → evaluate → save artifact on a 20-row synthetic dataset.

# src/pipeline/train_small.py
import os, json, joblib
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def make_data(n=20, seed=0):
    rng = np.random.default_rng(seed)
    X = rng.normal(size=(n, 3))
    y = (X[:,0] + 0.5*X[:,1] > 0).astype(int)
    return X, y

def run(output_dir):
    os.makedirs(output_dir, exist_ok=True)
    X, y = make_data()
    clf = LogisticRegression(max_iter=200)
    clf.fit(X, y)
    preds = clf.predict(X)
    acc = accuracy_score(y, preds)
    joblib.dump(clf, os.path.join(output_dir, "model.pkl"))
    with open(os.path.join(output_dir, "metrics.json"), "w") as f:
        json.dump({"accuracy": float(acc)}, f)
    return acc

# tests/integration/test_pipeline_small.py
import json, os, tempfile
from src.pipeline.train_small import run

def test_pipeline_runs_and_saves_artifacts():
    with tempfile.TemporaryDirectory() as d:
        acc = run(d)
        assert 0.5 <= acc <= 1.0  # lenient threshold for tiny data
        assert os.path.exists(os.path.join(d, "model.pkl"))
        with open(os.path.join(d, "metrics.json")) as f:
            metrics = json.load(f)
        assert "accuracy" in metrics

What it proves: components work together, artifacts exist, metrics are within a reasonable range.

Example 3: Smoke test after deploy

Goal: Check an inference API responds and shape is correct. This is a pseudo-example that you can adapt to your stack.

# tests/smoke/test_inference_smoke.py
import json
import time

def test_inference_smoke():
    # Replace with your client; keep it tiny and fast.
    # Pseudo-client shown as example only.
    def predict(payload):
        # e.g., requests.post(...).json()
        # Here we emulate a fast response.
        return {"predictions": [0.12, 0.87]}

    start = time.time()
    resp = predict({"instances": [[0.1, -0.2, 0.3], [0.0, 0.0, 0.1]]})
    elapsed = time.time() - start

    assert "predictions" in resp
    preds = resp["predictions"]
    assert isinstance(preds, list) and len(preds) == 2
    assert elapsed < 0.5  # sanity latency budget for smoke

What it proves: endpoint contract and latency budget. It does not validate model quality.

How to structure tests and run in CI

project/
  src/
    features/
    pipeline/
  tests/
    unit/
    integration/
    smoke/

Run order in CI: unit → integration → smoke. Fail early to save time.
Use small, synthetic fixtures checked into the repo.
Cache dependencies. Keep tests parallelizable and independent.

Typical CI commands

Unit: pytest tests/unit -q
Integration: pytest tests/integration -q -k small
Smoke: pytest tests/smoke -q

Exercises

Do these in your project or a fresh folder. They mirror the graded exercises below.

Exercise 1: Unit test a data transform

Create src/features/scaler.py with a safe min-max scaler handling NaNs and constant columns.
Write tests in tests/unit/test_scaler.py for typical, NaN-heavy, and constant inputs.
Run pytest and make all tests pass.

Starter snippet

# src/features/scaler.py
# Implement minmax_scale(x)

Exercise 2: Integration test a tiny training pipeline

Create src/pipeline/train_small.py that generates tiny synthetic data, trains a lightweight model, and writes model.pkl and metrics.json.
Write tests in tests/integration/test_pipeline_small.py that assert: training succeeds, accuracy is within [0.5, 1.0], and artifacts exist.
Keep runtime under 10 seconds.

Quality checklist

All tests are deterministic (fixed seeds).
No network calls in unit/integration tests.
Artifacts saved to a temporary directory.
Numerical checks use tolerances where needed.

Common mistakes and how to self-check

No seeding → flaky tests. Self-check: run tests 5 times; results must be identical.
Exact float equality → brittle. Self-check: use allclose/tolerances.
Oversized fixtures → slow CI. Self-check: count rows; target tens, not thousands.
Testing external services live in unit tests → fragile. Self-check: replace with mocks/fakes; move live checks to dedicated environments.
Smoke tests asserting strict accuracy → unnecessary. Self-check: smoke only checks run/shape/latency.

Learning path

Start: Write 5–10 unit tests for core transforms and metrics.
Next: One integration test for the smallest end-to-end path.
Then: One smoke test per deployable (batch job, API).
Finally: Add schema checks and artifact validation to the pipeline.

Practical projects

Add a test pyramid to an existing ML repo: at least 10 unit, 2 integration, 1 smoke test. Measure and record total runtime.
Create a reusable pytest fixture that generates a tiny tabular dataset and reuse it across 3 tests.

Next steps

Introduce data schema validation in preprocessing tests.
Add contract tests for your model API payloads.
Gate CI merges on passing unit and integration suites; run smoke after deploy.

Mini challenge

Break your scaler (e.g., return None for NaN inputs) and watch the unit test fail. Fix the implementation until tests pass again. Add one extra edge-case test you did not have before.

Progress and Quick Test

The quick test below is available to everyone. If you are logged in, your progress will be saved automatically.

Menu

Unit Integration And Smoke Tests

Table of Contents

Why this matters

Concept explained simply

Mental model

ML-specific testing nuances

Worked examples

How to structure tests and run in CI

Exercises

Exercise 1: Unit test a data transform

Exercise 2: Integration test a tiny training pipeline

Common mistakes and how to self-check

Learning path

Practical projects

Next steps

Mini challenge

Progress and Quick Test

Practice Exercises

Unit test a safe min-max scaler

Instructions

Expected Output

Integration test a tiny training pipeline

Unit Integration And Smoke Tests — Quick Test

Have questions about Unit Integration And Smoke Tests?

AI Assistant