How to learn Managing Random Seeds for ML Frameworks in Machine Learning Engineer for free

Who this is for

Machine Learning Engineers and Data Scientists who need reproducible experiments.
Anyone shipping models where auditability, debugging, or fairness checks matter.
Teams aligning results across machines, runs, or colleagues.

Prerequisites

Comfortable with Python.
Basic use of NumPy, scikit-learn, and at least one deep learning framework (PyTorch or TensorFlow).
Know how to run training scripts from the command line.

Why this matters

In real ML work, you will:

Compare experiments confidently. If randomness changes, your comparisons are unreliable.
Debug training issues. Deterministic runs help isolate bugs versus noise.
Run fairness and regression checks. Small seed changes can mask issues.
Collaborate across machines. Seed control reduces “works on my machine”.

Concept explained simply

A random seed is a starting number for a pseudo-random number generator (PRNG). Set the same seed, and you get the same sequence of random numbers. In ML, many steps use randomness: data splits, weight initialization, shuffling, augmentations, dropout, and more.

Mental model

Imagine each source of randomness is a separate deck of shuffled cards. Seeding is like deciding the shuffle order ahead of time. If you use the same seed and the same decks, you’ll draw the same cards in the same order.

Where randomness appears in ML frameworks

NumPy: random sampling, initializations.
Python: hashed dict iteration order (PYTHONHASHSEED), random module.
scikit-learn: train/test split, model randomness (e.g., RandomForest, KMeans).
PyTorch: weight init, DataLoader shuffling, augmentations, CUDA kernels, cuDNN algorithms.
TensorFlow/Keras: weight init, dataset shuffles/augmentations, graph ops.
XGBoost/LightGBM/CatBoost: sampling, feature subsampling, tree building.

Determinism vs performance trade-offs

Deterministic GPU ops can be slower. Some fast kernels are nondeterministic; enabling determinism may switch to slower algorithms.
DataLoader workers: parallel data loading adds nondeterminism unless carefully seeded.
Distributed training: more moving parts to seed and sync; might still vary with different hardware or versions.

Worked examples

Example 1 — NumPy + scikit-learn: consistent split and model

# Deterministic split and model in scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

SEED = 42
np.random.seed(SEED)

# Synthetic data
data = np.random.randn(1000, 10)
labels = (data[:, 0] + 0.5 * data[:, 1] > 0).astype(int)

# Deterministic split
X_train, X_test, y_train, y_test = train_test_split(
    data, labels, test_size=0.2, random_state=SEED
)

# Deterministic model (note random_state)
model = RandomForestClassifier(n_estimators=200, random_state=SEED)
model.fit(X_train, y_train)

preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print("Accuracy:", acc)

Run this twice and you should see the exact same accuracy and predictions.

Example 2 — PyTorch: seeds, cuDNN, DataLoader workers

import os
import random
import numpy as np
import torch

SEED = 123

# 1) Seed all relevant RNGs
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# 2) cuDNN determinism (may slow down)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# 3) DataLoader worker seeding
def seed_worker(worker_id):
    worker_seed = SEED + worker_id
    np.random.seed(worker_seed)
    random.seed(worker_seed)

generator = torch.Generator()
generator.manual_seed(SEED)

# Example dataset/dataloader
from torch.utils.data import DataLoader, TensorDataset
X = torch.randn(1000, 20)
y = (X[:, 0] + 0.2 * X[:, 1] > 0).long()
dataset = TensorDataset(X, y)
loader = DataLoader(
    dataset, batch_size=32, shuffle=True, num_workers=2,
    worker_init_fn=seed_worker, generator=generator
)

# Simple model
model = torch.nn.Sequential(
    torch.nn.Linear(20, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 2)
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()

first_three_losses = []
model.train()
for i, (bx, by) in enumerate(loader):
    optimizer.zero_grad()
    out = model(bx)
    loss = loss_fn(out, by)
    loss.backward()
    optimizer.step()
    if i < 3:
        first_three_losses.append(float(loss.item()))
    if i == 20:
        break
print("First 3 batch losses:", first_three_losses)

Run the script twice; the first three batch losses should match exactly.

Example 3 — TensorFlow/Keras: tf.random.set_seed and dataset shuffle

import os
import random
import numpy as np
import tensorflow as tf

SEED = 2024
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Optional: enable deterministic ops (may reduce speed)
os.environ["TF_DETERMINISTIC_OPS"] = "1"

# Data
data = np.random.randn(1000, 10).astype(np.float32)
labels = ((data[:, 0] + 0.5 * data[:, 1]) > 0).astype(np.int32)

ds = tf.data.Dataset.from_tensor_slices((data, labels))
ds = ds.shuffle(buffer_size=1000, seed=SEED, reshuffle_each_iteration=False)
ds = ds.batch(32)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(2)
])
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

h = model.fit(ds, epochs=1, verbose=0)
print("First epoch loss:", float(h.history['loss'][0]))

Re-running should produce the same first epoch loss when environment, versions, and hardware are unchanged.

Global vs local seeding

Global seeding: set seeds for Python random, NumPy, and framework-level RNGs at the very start of your script.
Local seeding: pass seed/random_state to functions that accept it (e.g., train_test_split, DataLoader generator, tf.data shuffle).
Both are needed. Global reduces accidental randomness; local ensures specific operations are controlled.

Data pipelines and augmentations

Shuffling: control DataLoader/DataSet shuffles with a seed or generator.
Augmentations: many libraries accept a seed or a deterministic flag. If not, wrap with your own seeded RNG.
Resampling: weighted sampling or class-balanced batches require a seeded sampler.

Multiprocessing and distributed training

Each worker process needs a distinct, deterministic seed derived from a base seed.
In PyTorch, use worker_init_fn and a torch.Generator with manual_seed.
In TensorFlow, prefer tf.data with seed and reshuffle_each_iteration=False when appropriate.
Across nodes/GPUs, ensure each rank calculates seeds deterministically (e.g., base_seed + global_rank).

GPU kernels and determinism

cuDNN and other GPU libraries may use nondeterministic algorithms by default.
PyTorch: set torch.backends.cudnn.deterministic=True and benchmark=False.
TensorFlow: set TF_DETERMINISTIC_OPS=1. Some ops may still be nondeterministic depending on versions.

Framework-specific seed parameters

scikit-learn: random_state in splitters and models (e.g., RandomForest, KMeans, LogisticRegression with saga).
XGBoost: seed and deterministic training depends on parameters like subsample/colsample_bytree; set seed and keep other sources fixed.
LightGBM: seed, deterministic=True for stricter determinism; may slow training.
CatBoost: random_seed and allow_writing_files=False for clean runs.

Environment and versioning

Even with seeds, different library versions or hardware can change results slightly.
Record: Python version, CUDA/cuDNN, framework versions, CPU/GPU model, and key env variables.
Save configs with your experiment (seed, flags, params) so runs are traceable.

Step-by-step: make a training run reproducible

Pick a base SEED and put it in one place (config/env).
Seed Python, NumPy, and your ML framework at the top of the main script.
Control all shuffles and splits with explicit seeds/generators.
Enable deterministic flags for GPU libraries if needed.
Derive worker/distributed seeds from the base seed.
Record versions and env variables used. Save them with results.

Exercises

Do these now. They mirror the graded exercises below.

Exercise 1 — Deterministic scikit-learn run

Goal: Make a split and model training reproducible across two runs.

Use NumPy + scikit-learn.
Set np.random.seed and model/train_test_split random_state.
Print a stable hash of predictions to confirm determinism.

Expected: identical hashes across two runs.

Exercise 2 — Deterministic PyTorch mini-training

Goal: Ensure the first 3 batch losses match across runs.

Seed Python, NumPy, PyTorch (CPU/GPU).
Set cuDNN deterministic, benchmark False.
Seed DataLoader via worker_init_fn and Generator.
Print first 3 batch losses and confirm they match across runs.

Expected: identical numbers across two runs.

Checklist you can use:
- Global seeds set (Python, NumPy, framework).
- Local seeds set (split, shuffle, augmentations).
- GPU determinism flags (if using GPU).
- Worker/distributed seeds derived from base seed.
- Environment versions recorded.

Common mistakes and self-check

Only seeding NumPy but not the ML framework. Self-check: print a random number from each RNG at start to verify seeding.
Forgetting DataLoader worker seeds. Self-check: run twice and compare first few batch indices; if different, fix worker_init_fn/generator.
Leaving cuDNN benchmark=True. Self-check: print the flag values; ensure deterministic=True, benchmark=False when needed.
Shuffling without a fixed seed in tf.data or scikit-learn. Self-check: set seed and reshuffle_each_iteration appropriately.
Changing library versions or hardware between runs. Self-check: log and compare versions in run metadata.

Practical projects

Reproducible baseline: Build a small image classifier twice (PyTorch or Keras) and store a run.json with seed, flags, versions. Confirm identical metrics.
Seed audit tool: Write a tiny Python module with a function setup_seeds(seed) and use it across your projects.
Determinism switch: Add a CLI flag --deterministic to your training script that toggles all relevant settings.

Learning path

Start here: manage seeds in simple CPU-only experiments.
Add GPU determinism and measure speed vs reproducibility.
Handle multiprocessing DataLoaders and tf.data pipelines.
Advance to distributed training and cross-machine reproducibility.

Next steps

Integrate a reproducibility checklist into your project template.
Automate environment capture (versions, flags) on each run.
Apply the same rigor when moving to distributed or production training.

Mini challenge

Take an existing project of yours. Add a single configuration file that sets: SEED, deterministic flags, DataLoader/tf.data seeds, and logs versions. Re-run your best experiment twice and confirm identical initial batches and final metric to 3–5 decimal places.

Quick Test

Take the quick test to check your understanding. Note: The quick test is available to everyone; only logged-in users get saved progress.

Menu

Managing Random Seeds

Table of Contents