How to learn Reproducible Training Runs for ML Frameworks in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, you need to compare experiments, debug model issues, and hand off models to teammates or production. If your training runs can’t be repeated, you can’t trust your results or fix bugs efficiently.

Experiment review: rerun a baseline to verify a new idea actually helps.
Production parity: reproduce a dev result on a staging server before deployment.
Regulated settings: prove how a model was trained (data, code, config, seed).

Concept explained simply

Training is a function of five things: Code, Data, Config, Environment, and Hardware. Reproducibility means fully specifying these inputs so the function returns the same output each time.

Mental model: Result = F(Code, Data, Config, Environment, Hardware). Lock them all and you’ll get the same result.

What commonly breaks reproducibility?

Randomness not seeded (Python random, NumPy, framework RNGs).
Non-deterministic GPU kernels (e.g., cuDNN/cublas, atomic adds).
Data order differences (shuffling without a seed; multi-worker loaders).
Version drift (framework or CUDA versions change ops/algorithms).
Parallelism and thread counts (race conditions, reduction order).
Unlogged config changes (learning rate, batch size, augmentations).
Mixed precision or different hardware causing numeric drift.

Worked examples

Example 1: Deterministic mini-run in PyTorch (CPU)

import os, random, numpy as np, torch

def set_seed(seed=123):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_seed(123)
X = torch.randn(256, 10)
y = (torch.rand(256) > 0.5).long()
model = torch.nn.Sequential(
    torch.nn.Linear(10, 16),
    torch.nn.ReLU(),
    torch.nn.Linear(16, 2)
)
loss_fn = torch.nn.CrossEntropyLoss()
opt = torch.optim.SGD(model.parameters(), lr=0.1)

for epoch in range(2):
    opt.zero_grad()
    logits = model(X)
    loss = loss_fn(logits, y)
    loss.backward()
    opt.step()
    print(f"epoch {epoch} loss {loss.item():.6f}")

Run this script twice; the printed losses should match exactly.

GPU note

For CUDA runs, set the environment before any CUDA initialization:

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"  # or ":16:8"

Then add torch.cuda.manual_seed_all(seed) and move tensors/model to cuda(). Some ops still may not be deterministic; consider CPU for exact-bytes checks.

Example 2: Deterministic Keras training with tf.data

import os, random, numpy as np, tensorflow as tf
os.environ["TF_DETERMINISTIC_OPS"] = "1"
os.environ["PYTHONHASHSEED"] = "123"
seed = 123
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)

X = np.random.randn(256, 10).astype("float32")
y = (np.random.rand(256) > 0.5).astype("int32")

ds = tf.data.Dataset.from_tensor_slices((X, y)) \
    .shuffle(buffer_size=256, seed=seed, reshuffle_each_iteration=False) \
    .batch(32, drop_remainder=True)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation="relu", input_shape=(10,)),
    tf.keras.layers.Dense(2, activation="softmax")
])
model.compile(optimizer=tf.keras.optimizers.SGD(0.1),
              loss=tf.keras.losses.SparseCategoricalCrossentropy())

hist = model.fit(ds, epochs=1, verbose=0)
print(round(hist.history['loss'][-1], 6))

Run twice; the loss value should match exactly when using the same versions and environment.

Example 3: scikit-learn pipeline with fixed seeds and threads

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

rng = np.random.RandomState(123)
X = rng.randn(500, 5)
y = (rng.rand(500) > 0.5).astype(int)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, shuffle=True, random_state=123, stratify=y
)
clf = LogisticRegression(max_iter=200, random_state=123, n_jobs=1)
clf.fit(X_tr, y_tr)
print(clf.score(X_te, y_te))

Setting random_state and limiting parallelism (n_jobs=1) helps guarantee repeatability.

Reproducibility checklist

Seed all RNGs (Python, NumPy, framework) and record the seed.
Fix data order (seeded shuffles; disable reshuffle if needed).
Choose deterministic ops and set framework flags for determinism.
Pin versions (framework, CUDA, drivers) and record them.
Control parallelism (threads, workers) for stable ordering.
Log full config (hyperparameters, augmentations, preprocessing).
Snapshot or hash the dataset slices used for training/validation.
Record code version (e.g., Git commit) and hardware summary.

Exercises

Deterministic mini-run (PyTorch or TensorFlow).
- Use the example for your framework.
- Run it three times; verify identical printed losses.
- Optional: switch to GPU and apply the deterministic settings; confirm stability.
Create a "run manifest" JSON.
- Programmatically capture: seeds, framework versions, Python version, OS, selected hyperparameters, and a quick dataset hash.
- Write to run_manifest.json before training starts.
- Repeat the run; confirm the manifest stays identical when nothing changes.

Self-check: Does re-running without changes produce the same metrics and the same manifest file?

Common mistakes and how to self-check

Unseeded components

Fix: Seed Python random, NumPy, and your framework. In TF, set TF_DETERMINISTIC_OPS=1. In PyTorch, enable deterministic algorithms.

Data reshuffled differently each epoch

Fix: Use a fixed seed and, if you need strict repeatability, set reshuffle_each_iteration=False (TF) or a seeded Generator for PyTorch DataLoader.

GPU nondeterminism

Fix: Prefer CPU for exact checks, or set deterministic flags and required environment variables. Validate by comparing metrics across repeated runs.

Version drift

Fix: Pin dependencies. Add versions to your manifest and fail fast if versions differ from expected.

Hidden parallelism

Fix: Control thread counts and data loader workers. For verification, set to 1 and scale up later if acceptable.

Mini challenge

Take a small model you already have. Make it bit-for-bit repeatable for one epoch on your machine. Produce a manifest file and a short note describing exactly which settings you changed to achieve this. If a GPU op remains nondeterministic, document it and explain your mitigation (e.g., switching to CPU for verification).

Who this is for

ML engineers and researchers who need consistent experiments.
Data scientists moving models from notebooks to production.

Prerequisites

Basic Python and NumPy.
Hands-on experience with at least one ML framework (PyTorch or TensorFlow).

Learning path

Learn framework seeding and deterministic flags (PyTorch/TF basics).
Control data pipelines (seeded shuffles, worker settings).
Pin and record environments (versions, OS, hardware).
Automate manifests and sanity checks in your training entry-point.

Practical projects

Deterministic Baseline: Convert one of your recent experiments into a fully reproducible script with a manifest.
Repro Sanity CI: Add a small CI job that runs a 1-epoch training and asserts expected metric values within a tiny tolerance.
Dataset Snapshot: Write a utility to compute and save dataset hashes for your train/val splits.

Next steps

Integrate your reproducibility checklist into every new project template.
Automate run manifests and metric assertions in your training launcher.
Document deviations (e.g., allowed non-determinism for speed) and why.

Quick test

The quick test is available to everyone; only logged-in users have their progress saved.

Menu

Reproducible Training Runs

Table of Contents

Why this matters

Concept explained simply

Worked examples

Example 1: Deterministic mini-run in PyTorch (CPU)

Example 2: Deterministic Keras training with tf.data

Example 3: scikit-learn pipeline with fixed seeds and threads

Reproducibility checklist

Exercises

Common mistakes and how to self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Quick test

Practice Exercises

Deterministic mini-run in your framework

Instructions

Expected Output

Create a run manifest (environment + config + data hash)

Reproducible Training Runs — Quick Test

Have questions about Reproducible Training Runs?

AI Assistant