Who this is for
This lesson is for aspiring and practicing MLOps Engineers, ML engineers, and data scientists who need to make ML experiments and pipelines repeatable across teammates, machines, and time.
Prerequisites
- Basic Python knowledge
- Familiarity with virtual environments (venv/conda) or containers
- Awareness of ML training workflows (data, preprocessing, model, metrics)
Why this matters
In real MLOps work you will:
- Debug model regressions months later and must re-run the exact experiment
- Hand off models to Ops and auditors who require evidence of data, code, and parameters used
- Scale experiments across CI runners and cloud machines consistently
- Patch security issues without changing model behavior unintentionally
Reproducibility prevents “works on my machine” surprises and enables trust in models.
Concept explained simply
Reproducibility means another person (or your future self) can re-run an ML pipeline and get the same results using the same inputs. It requires controlling four things:
- Code: versioned and immutable at run time
- Environment: same OS, libraries, and hardware settings
- Data: versioned datasets and deterministic splits
- Configuration: explicit hyperparameters, seeds, and feature flags
Mental model
Think of a model run as a recipe card. If the card precisely lists ingredients (data versions), utensils (environment), and steps with exact measurements and timing (config and code), any cook (engineer) can produce the same dish (model) repeatedly.
Core reproducibility principles
- Immutable inputs: Pin versions for data and dependencies; never rely on “latest”.
- Single source of truth: Store run config in a file (YAML/JSON). Treat CLI flags as overrides that are captured back into the run record.
- Determinism: Set global seeds and control non-deterministic ops where possible.
- Provenance tracking: Capture code commit, data version/hash, config, environment, and outputs together.
- Isolation: Use virtual envs or containers so each run is insulated from host changes.
Tip: Minimal viable reproducibility (MVR) bundle
- requirements.txt or environment.yml with pinned versions
- config.yaml with all hyperparameters and seeds
- git commit SHA
- data fingerprint (version tag, path, or hash)
- run log with metrics and artifact paths
Worked examples
Example 1 — Reproducible environment
Goal: ensure the same libraries are installed every time.
- Create a virtual environment.
- Pin packages: numpy==1.26.4, pandas==2.1.4, scikit-learn==1.3.2 (example versions).
- Record Python version (e.g., 3.10.x).
Show example files
requirements.txt
numpy==1.26.4
pandas==2.1.4
scikit-learn==1.3.2
runtime.txt (optional)
python-3.10.13
Run: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
Example 2 — Reproducible data
Goal: always train on the same data snapshot.
- Store datasets with a version tag (e.g., s3://bucket/data/2024-09-01/).
- Fingerprint the data with a hash of file contents or row count + checksum.
- Log the data version/hash in the run record.
Show data signature idea
{
"data_version": "2024-09-01",
"files": [
{"name": "train.csv", "sha256": "b3f..."},
{"name": "valid.csv", "sha256": "a1d..."}
],
"n_rows": 120345
}
Example 3 — Reproducible training run
Goal: deterministic model training with tracked config.
- Put all params in config.yaml (model type, hyperparameters, seed, features).
- Set seeds for Python, NumPy, and framework (when applicable).
- Record run metadata: git SHA, config hash, data hash, environment lockfile.
Show minimal Python seed pattern
import os, random, numpy as np
SEED = 42
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
# For frameworks: e.g., torch.manual_seed(SEED) if used
# Load config.yaml, then serialize to canonical JSON and hash it for run_id
Quick reproducibility checklist
- All dependencies are pinned (no unpinned wildcards like >= or ~).
- Python and OS base image recorded.
- Data snapshot/version and checksum stored.
- Global seed(s) are set and logged.
- Config file saved with run artifacts.
- Git commit SHA recorded; dirty working tree disallowed or captured.
- Artifacts stored in folders named with run identifiers.
Exercises you can do now
Do these to solidify the habit. They mirror the graded exercises below.
Exercise 1 — Pin a minimal environment
- Start from unpinned dependencies: numpy, pandas, scikit-learn.
- Choose specific, recent stable versions and create a requirements.txt with exact pins.
- Add your Python version in a runtime.txt (optional but recommended).
Mini task hint
Use the format package==X.Y.Z. Avoid wildcards.
Exercise 2 — Deterministic seed setup
- Create a snippet that sets PYTHONHASHSEED, random.seed, and numpy.random.seed to the same integer.
- Print the seed value at start of your script for traceability.
- Optionally, add comments for framework-specific seeds (e.g., torch, tensorflow).
Mini task hint
Set PYTHONHASHSEED before importing code that relies on hashing.
Common mistakes and how to self-check
- Unpinned deps: You see surprise behavior after a fresh install. Self-check: run
pip freezeand verify exact versions are present in your lock file. - No data version: You cannot prove which rows trained the model. Self-check: your run record must include a data version/hash.
- Missing seeds: Results change across runs. Self-check: ensure seeds are set once at process start.
- Config drift: CLI overrides not captured. Self-check: write final, merged config to the artifacts directory.
- Dirty code state: Local edits not committed. Self-check: block runs on dirty repos or record the diff in the run metadata.
Debugging tip: reproducibility triangle
If results differ, check in order: (1) environment, (2) data, (3) config/code. Most issues are environment mismatches or unnoticed data drift.
Practical projects
- Reproducible baseline: Build a small classification pipeline that saves a run folder containing config.yaml, requirements.txt, git_sha.txt, data_signature.json, metrics.json, and model.pkl.
- Data version demo: Create two dataset snapshots with minor differences. Show how changing only the data_version reproduces or changes results predictably.
- Seed audit: Run the same training 5 times with seeds set and 5 times without. Record variance to demonstrate determinism benefits.
Learning path
- Start: Reproducibility Principles (this lesson)
- Next: Environment management and containerization
- Then: Data versioning and lineage
- Finally: Experiment tracking and CI/CD integration
Next steps
- Adopt a standard run folder structure across your team.
- Add a pre-run check that validates pins, seeds, and clean git state.
- Automate artifact collection in your training script.
Mini challenge
Take an old notebook that produces slightly different results each run. Convert it into a script that reads config.yaml, sets seeds, writes artifacts, and re-runs deterministically.
Quick Test
Take the short test to check your understanding. Available to everyone for free; only logged-in users get saved progress.