luvv to helpDiscover the Best Free Online Tools
Topic 2 of 6

Reproducibility Principles

Learn Reproducibility Principles for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Who this is for

This lesson is for aspiring and practicing MLOps Engineers, ML engineers, and data scientists who need to make ML experiments and pipelines repeatable across teammates, machines, and time.

Prerequisites

  • Basic Python knowledge
  • Familiarity with virtual environments (venv/conda) or containers
  • Awareness of ML training workflows (data, preprocessing, model, metrics)

Why this matters

In real MLOps work you will:

  • Debug model regressions months later and must re-run the exact experiment
  • Hand off models to Ops and auditors who require evidence of data, code, and parameters used
  • Scale experiments across CI runners and cloud machines consistently
  • Patch security issues without changing model behavior unintentionally

Reproducibility prevents “works on my machine” surprises and enables trust in models.

Concept explained simply

Reproducibility means another person (or your future self) can re-run an ML pipeline and get the same results using the same inputs. It requires controlling four things:

  • Code: versioned and immutable at run time
  • Environment: same OS, libraries, and hardware settings
  • Data: versioned datasets and deterministic splits
  • Configuration: explicit hyperparameters, seeds, and feature flags

Mental model

Think of a model run as a recipe card. If the card precisely lists ingredients (data versions), utensils (environment), and steps with exact measurements and timing (config and code), any cook (engineer) can produce the same dish (model) repeatedly.

Core reproducibility principles

  • Immutable inputs: Pin versions for data and dependencies; never rely on “latest”.
  • Single source of truth: Store run config in a file (YAML/JSON). Treat CLI flags as overrides that are captured back into the run record.
  • Determinism: Set global seeds and control non-deterministic ops where possible.
  • Provenance tracking: Capture code commit, data version/hash, config, environment, and outputs together.
  • Isolation: Use virtual envs or containers so each run is insulated from host changes.
Tip: Minimal viable reproducibility (MVR) bundle
  • requirements.txt or environment.yml with pinned versions
  • config.yaml with all hyperparameters and seeds
  • git commit SHA
  • data fingerprint (version tag, path, or hash)
  • run log with metrics and artifact paths

Worked examples

Example 1 — Reproducible environment

Goal: ensure the same libraries are installed every time.

  1. Create a virtual environment.
  2. Pin packages: numpy==1.26.4, pandas==2.1.4, scikit-learn==1.3.2 (example versions).
  3. Record Python version (e.g., 3.10.x).
Show example files

requirements.txt

numpy==1.26.4
pandas==2.1.4
scikit-learn==1.3.2

runtime.txt (optional)

python-3.10.13

Run: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt

Example 2 — Reproducible data

Goal: always train on the same data snapshot.

  1. Store datasets with a version tag (e.g., s3://bucket/data/2024-09-01/).
  2. Fingerprint the data with a hash of file contents or row count + checksum.
  3. Log the data version/hash in the run record.
Show data signature idea
{
  "data_version": "2024-09-01",
  "files": [
    {"name": "train.csv", "sha256": "b3f..."},
    {"name": "valid.csv", "sha256": "a1d..."}
  ],
  "n_rows": 120345
}

Example 3 — Reproducible training run

Goal: deterministic model training with tracked config.

  1. Put all params in config.yaml (model type, hyperparameters, seed, features).
  2. Set seeds for Python, NumPy, and framework (when applicable).
  3. Record run metadata: git SHA, config hash, data hash, environment lockfile.
Show minimal Python seed pattern
import os, random, numpy as np
SEED = 42
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
# For frameworks: e.g., torch.manual_seed(SEED) if used
# Load config.yaml, then serialize to canonical JSON and hash it for run_id

Quick reproducibility checklist

  • All dependencies are pinned (no unpinned wildcards like >= or ~).
  • Python and OS base image recorded.
  • Data snapshot/version and checksum stored.
  • Global seed(s) are set and logged.
  • Config file saved with run artifacts.
  • Git commit SHA recorded; dirty working tree disallowed or captured.
  • Artifacts stored in folders named with run identifiers.

Exercises you can do now

Do these to solidify the habit. They mirror the graded exercises below.

Exercise 1 — Pin a minimal environment

  1. Start from unpinned dependencies: numpy, pandas, scikit-learn.
  2. Choose specific, recent stable versions and create a requirements.txt with exact pins.
  3. Add your Python version in a runtime.txt (optional but recommended).
Mini task hint

Use the format package==X.Y.Z. Avoid wildcards.

Exercise 2 — Deterministic seed setup

  1. Create a snippet that sets PYTHONHASHSEED, random.seed, and numpy.random.seed to the same integer.
  2. Print the seed value at start of your script for traceability.
  3. Optionally, add comments for framework-specific seeds (e.g., torch, tensorflow).
Mini task hint

Set PYTHONHASHSEED before importing code that relies on hashing.

Common mistakes and how to self-check

  • Unpinned deps: You see surprise behavior after a fresh install. Self-check: run pip freeze and verify exact versions are present in your lock file.
  • No data version: You cannot prove which rows trained the model. Self-check: your run record must include a data version/hash.
  • Missing seeds: Results change across runs. Self-check: ensure seeds are set once at process start.
  • Config drift: CLI overrides not captured. Self-check: write final, merged config to the artifacts directory.
  • Dirty code state: Local edits not committed. Self-check: block runs on dirty repos or record the diff in the run metadata.
Debugging tip: reproducibility triangle

If results differ, check in order: (1) environment, (2) data, (3) config/code. Most issues are environment mismatches or unnoticed data drift.

Practical projects

  • Reproducible baseline: Build a small classification pipeline that saves a run folder containing config.yaml, requirements.txt, git_sha.txt, data_signature.json, metrics.json, and model.pkl.
  • Data version demo: Create two dataset snapshots with minor differences. Show how changing only the data_version reproduces or changes results predictably.
  • Seed audit: Run the same training 5 times with seeds set and 5 times without. Record variance to demonstrate determinism benefits.

Learning path

  • Start: Reproducibility Principles (this lesson)
  • Next: Environment management and containerization
  • Then: Data versioning and lineage
  • Finally: Experiment tracking and CI/CD integration

Next steps

  • Adopt a standard run folder structure across your team.
  • Add a pre-run check that validates pins, seeds, and clean git state.
  • Automate artifact collection in your training script.

Mini challenge

Take an old notebook that produces slightly different results each run. Convert it into a script that reads config.yaml, sets seeds, writes artifacts, and re-runs deterministically.

Quick Test

Take the short test to check your understanding. Available to everyone for free; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

You are given unpinned dependencies for a small scikit-learn project:

numpy
pandas
scikit-learn
  • Create a requirements.txt with exact versions (choose stable recent ones).
  • Optionally add a runtime.txt to specify Python 3.10.x.
  • Write one sentence explaining why exact pins matter for ML reproducibility.
Expected Output
A requirements.txt with exact version pins for numpy, pandas, and scikit-learn, and an optional runtime.txt specifying Python 3.10.x. A brief explanation sentence about why pinning ensures consistent behavior.

Reproducibility Principles — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Reproducibility Principles?

AI Assistant

Ask questions about this tool