How to learn Reproducibility Principles for MLOps Foundations in MLOps Engineer for free

Who this is for

This lesson is for aspiring and practicing MLOps Engineers, ML engineers, and data scientists who need to make ML experiments and pipelines repeatable across teammates, machines, and time.

Prerequisites

Basic Python knowledge
Familiarity with virtual environments (venv/conda) or containers
Awareness of ML training workflows (data, preprocessing, model, metrics)

Why this matters

In real MLOps work you will:

Debug model regressions months later and must re-run the exact experiment
Hand off models to Ops and auditors who require evidence of data, code, and parameters used
Scale experiments across CI runners and cloud machines consistently
Patch security issues without changing model behavior unintentionally

Reproducibility prevents “works on my machine” surprises and enables trust in models.

Concept explained simply

Reproducibility means another person (or your future self) can re-run an ML pipeline and get the same results using the same inputs. It requires controlling four things:

Code: versioned and immutable at run time
Environment: same OS, libraries, and hardware settings
Data: versioned datasets and deterministic splits
Configuration: explicit hyperparameters, seeds, and feature flags

Mental model

Think of a model run as a recipe card. If the card precisely lists ingredients (data versions), utensils (environment), and steps with exact measurements and timing (config and code), any cook (engineer) can produce the same dish (model) repeatedly.

Core reproducibility principles

Immutable inputs: Pin versions for data and dependencies; never rely on “latest”.
Single source of truth: Store run config in a file (YAML/JSON). Treat CLI flags as overrides that are captured back into the run record.
Determinism: Set global seeds and control non-deterministic ops where possible.
Provenance tracking: Capture code commit, data version/hash, config, environment, and outputs together.
Isolation: Use virtual envs or containers so each run is insulated from host changes.

Tip: Minimal viable reproducibility (MVR) bundle

requirements.txt or environment.yml with pinned versions
config.yaml with all hyperparameters and seeds
git commit SHA
data fingerprint (version tag, path, or hash)
run log with metrics and artifact paths

Worked examples

Example 1 — Reproducible environment

Goal: ensure the same libraries are installed every time.

Create a virtual environment.
Pin packages: numpy==1.26.4, pandas==2.1.4, scikit-learn==1.3.2 (example versions).
Record Python version (e.g., 3.10.x).

Show example files

requirements.txt

numpy==1.26.4
pandas==2.1.4
scikit-learn==1.3.2

runtime.txt (optional)

python-3.10.13

Run: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt

Example 2 — Reproducible data

Goal: always train on the same data snapshot.

Store datasets with a version tag (e.g., s3://bucket/data/2024-09-01/).
Fingerprint the data with a hash of file contents or row count + checksum.
Log the data version/hash in the run record.

Show data signature idea

{
  "data_version": "2024-09-01",
  "files": [
    {"name": "train.csv", "sha256": "b3f..."},
    {"name": "valid.csv", "sha256": "a1d..."}
  ],
  "n_rows": 120345
}

Example 3 — Reproducible training run

Goal: deterministic model training with tracked config.

Put all params in config.yaml (model type, hyperparameters, seed, features).
Set seeds for Python, NumPy, and framework (when applicable).
Record run metadata: git SHA, config hash, data hash, environment lockfile.

Show minimal Python seed pattern

import os, random, numpy as np
SEED = 42
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
# For frameworks: e.g., torch.manual_seed(SEED) if used

# Load config.yaml, then serialize to canonical JSON and hash it for run_id

Quick reproducibility checklist

All dependencies are pinned (no unpinned wildcards like >= or ~).
Python and OS base image recorded.
Data snapshot/version and checksum stored.
Global seed(s) are set and logged.
Config file saved with run artifacts.
Git commit SHA recorded; dirty working tree disallowed or captured.
Artifacts stored in folders named with run identifiers.

Exercises you can do now

Do these to solidify the habit. They mirror the graded exercises below.

Exercise 1 — Pin a minimal environment

Start from unpinned dependencies: numpy, pandas, scikit-learn.
Choose specific, recent stable versions and create a requirements.txt with exact pins.
Add your Python version in a runtime.txt (optional but recommended).

Mini task hint

Use the format package==X.Y.Z. Avoid wildcards.

Exercise 2 — Deterministic seed setup

Create a snippet that sets PYTHONHASHSEED, random.seed, and numpy.random.seed to the same integer.
Print the seed value at start of your script for traceability.
Optionally, add comments for framework-specific seeds (e.g., torch, tensorflow).

Mini task hint

Set PYTHONHASHSEED before importing code that relies on hashing.

Common mistakes and how to self-check

Unpinned deps: You see surprise behavior after a fresh install. Self-check: run pip freeze and verify exact versions are present in your lock file.
No data version: You cannot prove which rows trained the model. Self-check: your run record must include a data version/hash.
Missing seeds: Results change across runs. Self-check: ensure seeds are set once at process start.
Config drift: CLI overrides not captured. Self-check: write final, merged config to the artifacts directory.
Dirty code state: Local edits not committed. Self-check: block runs on dirty repos or record the diff in the run metadata.

Debugging tip: reproducibility triangle

If results differ, check in order: (1) environment, (2) data, (3) config/code. Most issues are environment mismatches or unnoticed data drift.

Practical projects

Reproducible baseline: Build a small classification pipeline that saves a run folder containing config.yaml, requirements.txt, git_sha.txt, data_signature.json, metrics.json, and model.pkl.
Data version demo: Create two dataset snapshots with minor differences. Show how changing only the data_version reproduces or changes results predictably.
Seed audit: Run the same training 5 times with seeds set and 5 times without. Record variance to demonstrate determinism benefits.

Learning path

Start: Reproducibility Principles (this lesson)
Next: Environment management and containerization
Then: Data versioning and lineage
Finally: Experiment tracking and CI/CD integration

Next steps

Adopt a standard run folder structure across your team.
Add a pre-run check that validates pins, seeds, and clean git state.
Automate artifact collection in your training script.

Mini challenge

Take an old notebook that produces slightly different results each run. Convert it into a script that reads config.yaml, sets seeds, writes artifacts, and re-runs deterministically.

Quick Test

Take the short test to check your understanding. Available to everyone for free; only logged-in users get saved progress.

Menu

Reproducibility Principles

Table of Contents