luvv to helpDiscover the Best Free Online Tools
Topic 10 of 10

Managing Random Seeds

Learn Managing Random Seeds for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

  • Machine Learning Engineers and Data Scientists who need reproducible experiments.
  • Anyone shipping models where auditability, debugging, or fairness checks matter.
  • Teams aligning results across machines, runs, or colleagues.

Prerequisites

  • Comfortable with Python.
  • Basic use of NumPy, scikit-learn, and at least one deep learning framework (PyTorch or TensorFlow).
  • Know how to run training scripts from the command line.

Why this matters

In real ML work, you will:

  • Compare experiments confidently. If randomness changes, your comparisons are unreliable.
  • Debug training issues. Deterministic runs help isolate bugs versus noise.
  • Run fairness and regression checks. Small seed changes can mask issues.
  • Collaborate across machines. Seed control reduces “works on my machine”.

Concept explained simply

A random seed is a starting number for a pseudo-random number generator (PRNG). Set the same seed, and you get the same sequence of random numbers. In ML, many steps use randomness: data splits, weight initialization, shuffling, augmentations, dropout, and more.

Mental model

Imagine each source of randomness is a separate deck of shuffled cards. Seeding is like deciding the shuffle order ahead of time. If you use the same seed and the same decks, you’ll draw the same cards in the same order.

Where randomness appears in ML frameworks

  • NumPy: random sampling, initializations.
  • Python: hashed dict iteration order (PYTHONHASHSEED), random module.
  • scikit-learn: train/test split, model randomness (e.g., RandomForest, KMeans).
  • PyTorch: weight init, DataLoader shuffling, augmentations, CUDA kernels, cuDNN algorithms.
  • TensorFlow/Keras: weight init, dataset shuffles/augmentations, graph ops.
  • XGBoost/LightGBM/CatBoost: sampling, feature subsampling, tree building.

Determinism vs performance trade-offs

  • Deterministic GPU ops can be slower. Some fast kernels are nondeterministic; enabling determinism may switch to slower algorithms.
  • DataLoader workers: parallel data loading adds nondeterminism unless carefully seeded.
  • Distributed training: more moving parts to seed and sync; might still vary with different hardware or versions.

Worked examples

Example 1 — NumPy + scikit-learn: consistent split and model
# Deterministic split and model in scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

SEED = 42
np.random.seed(SEED)

# Synthetic data
data = np.random.randn(1000, 10)
labels = (data[:, 0] + 0.5 * data[:, 1] > 0).astype(int)

# Deterministic split
X_train, X_test, y_train, y_test = train_test_split(
    data, labels, test_size=0.2, random_state=SEED
)

# Deterministic model (note random_state)
model = RandomForestClassifier(n_estimators=200, random_state=SEED)
model.fit(X_train, y_train)

preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print("Accuracy:", acc)

Run this twice and you should see the exact same accuracy and predictions.

Example 2 — PyTorch: seeds, cuDNN, DataLoader workers
import os
import random
import numpy as np
import torch

SEED = 123

# 1) Seed all relevant RNGs
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# 2) cuDNN determinism (may slow down)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# 3) DataLoader worker seeding
def seed_worker(worker_id):
    worker_seed = SEED + worker_id
    np.random.seed(worker_seed)
    random.seed(worker_seed)

generator = torch.Generator()
generator.manual_seed(SEED)

# Example dataset/dataloader
from torch.utils.data import DataLoader, TensorDataset
X = torch.randn(1000, 20)
y = (X[:, 0] + 0.2 * X[:, 1] > 0).long()
dataset = TensorDataset(X, y)
loader = DataLoader(
    dataset, batch_size=32, shuffle=True, num_workers=2,
    worker_init_fn=seed_worker, generator=generator
)

# Simple model
model = torch.nn.Sequential(
    torch.nn.Linear(20, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 2)
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()

first_three_losses = []
model.train()
for i, (bx, by) in enumerate(loader):
    optimizer.zero_grad()
    out = model(bx)
    loss = loss_fn(out, by)
    loss.backward()
    optimizer.step()
    if i < 3:
        first_three_losses.append(float(loss.item()))
    if i == 20:
        break
print("First 3 batch losses:", first_three_losses)

Run the script twice; the first three batch losses should match exactly.

Example 3 — TensorFlow/Keras: tf.random.set_seed and dataset shuffle
import os
import random
import numpy as np
import tensorflow as tf

SEED = 2024
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Optional: enable deterministic ops (may reduce speed)
os.environ["TF_DETERMINISTIC_OPS"] = "1"

# Data
data = np.random.randn(1000, 10).astype(np.float32)
labels = ((data[:, 0] + 0.5 * data[:, 1]) > 0).astype(np.int32)

ds = tf.data.Dataset.from_tensor_slices((data, labels))
ds = ds.shuffle(buffer_size=1000, seed=SEED, reshuffle_each_iteration=False)
ds = ds.batch(32)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(2)
])
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

h = model.fit(ds, epochs=1, verbose=0)
print("First epoch loss:", float(h.history['loss'][0]))

Re-running should produce the same first epoch loss when environment, versions, and hardware are unchanged.

Global vs local seeding

  • Global seeding: set seeds for Python random, NumPy, and framework-level RNGs at the very start of your script.
  • Local seeding: pass seed/random_state to functions that accept it (e.g., train_test_split, DataLoader generator, tf.data shuffle).
  • Both are needed. Global reduces accidental randomness; local ensures specific operations are controlled.

Data pipelines and augmentations

  • Shuffling: control DataLoader/DataSet shuffles with a seed or generator.
  • Augmentations: many libraries accept a seed or a deterministic flag. If not, wrap with your own seeded RNG.
  • Resampling: weighted sampling or class-balanced batches require a seeded sampler.

Multiprocessing and distributed training

  • Each worker process needs a distinct, deterministic seed derived from a base seed.
  • In PyTorch, use worker_init_fn and a torch.Generator with manual_seed.
  • In TensorFlow, prefer tf.data with seed and reshuffle_each_iteration=False when appropriate.
  • Across nodes/GPUs, ensure each rank calculates seeds deterministically (e.g., base_seed + global_rank).

GPU kernels and determinism

  • cuDNN and other GPU libraries may use nondeterministic algorithms by default.
  • PyTorch: set torch.backends.cudnn.deterministic=True and benchmark=False.
  • TensorFlow: set TF_DETERMINISTIC_OPS=1. Some ops may still be nondeterministic depending on versions.

Framework-specific seed parameters

  • scikit-learn: random_state in splitters and models (e.g., RandomForest, KMeans, LogisticRegression with saga).
  • XGBoost: seed and deterministic training depends on parameters like subsample/colsample_bytree; set seed and keep other sources fixed.
  • LightGBM: seed, deterministic=True for stricter determinism; may slow training.
  • CatBoost: random_seed and allow_writing_files=False for clean runs.

Environment and versioning

  • Even with seeds, different library versions or hardware can change results slightly.
  • Record: Python version, CUDA/cuDNN, framework versions, CPU/GPU model, and key env variables.
  • Save configs with your experiment (seed, flags, params) so runs are traceable.

Step-by-step: make a training run reproducible

  1. Pick a base SEED and put it in one place (config/env).
  2. Seed Python, NumPy, and your ML framework at the top of the main script.
  3. Control all shuffles and splits with explicit seeds/generators.
  4. Enable deterministic flags for GPU libraries if needed.
  5. Derive worker/distributed seeds from the base seed.
  6. Record versions and env variables used. Save them with results.

Exercises

Do these now. They mirror the graded exercises below.

Exercise 1 — Deterministic scikit-learn run

Goal: Make a split and model training reproducible across two runs.

  • Use NumPy + scikit-learn.
  • Set np.random.seed and model/train_test_split random_state.
  • Print a stable hash of predictions to confirm determinism.

Expected: identical hashes across two runs.

Exercise 2 — Deterministic PyTorch mini-training

Goal: Ensure the first 3 batch losses match across runs.

  • Seed Python, NumPy, PyTorch (CPU/GPU).
  • Set cuDNN deterministic, benchmark False.
  • Seed DataLoader via worker_init_fn and Generator.
  • Print first 3 batch losses and confirm they match across runs.

Expected: identical numbers across two runs.

  • Checklist you can use:
    • Global seeds set (Python, NumPy, framework).
    • Local seeds set (split, shuffle, augmentations).
    • GPU determinism flags (if using GPU).
    • Worker/distributed seeds derived from base seed.
    • Environment versions recorded.

Common mistakes and self-check

  • Only seeding NumPy but not the ML framework. Self-check: print a random number from each RNG at start to verify seeding.
  • Forgetting DataLoader worker seeds. Self-check: run twice and compare first few batch indices; if different, fix worker_init_fn/generator.
  • Leaving cuDNN benchmark=True. Self-check: print the flag values; ensure deterministic=True, benchmark=False when needed.
  • Shuffling without a fixed seed in tf.data or scikit-learn. Self-check: set seed and reshuffle_each_iteration appropriately.
  • Changing library versions or hardware between runs. Self-check: log and compare versions in run metadata.

Practical projects

  • Reproducible baseline: Build a small image classifier twice (PyTorch or Keras) and store a run.json with seed, flags, versions. Confirm identical metrics.
  • Seed audit tool: Write a tiny Python module with a function setup_seeds(seed) and use it across your projects.
  • Determinism switch: Add a CLI flag --deterministic to your training script that toggles all relevant settings.

Learning path

  • Start here: manage seeds in simple CPU-only experiments.
  • Add GPU determinism and measure speed vs reproducibility.
  • Handle multiprocessing DataLoaders and tf.data pipelines.
  • Advance to distributed training and cross-machine reproducibility.

Next steps

  • Integrate a reproducibility checklist into your project template.
  • Automate environment capture (versions, flags) on each run.
  • Apply the same rigor when moving to distributed or production training.

Mini challenge

Take an existing project of yours. Add a single configuration file that sets: SEED, deterministic flags, DataLoader/tf.data seeds, and logs versions. Re-run your best experiment twice and confirm identical initial batches and final metric to 3–5 decimal places.

Quick Test

Take the quick test to check your understanding. Note: The quick test is available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Make a small scikit-learn classification pipeline reproducible across two runs.

  • Generate synthetic data with NumPy.
  • Use train_test_split with random_state.
  • Train RandomForestClassifier with random_state.
  • Print a stable hash (e.g., md5) of predictions.
Tips
  • Set np.random.seed at the top.
  • Keep all seeds as the same constant.
Expected Output
The printed prediction hash (and accuracy) matches exactly across two consecutive runs.

Managing Random Seeds — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Managing Random Seeds?

AI Assistant

Ask questions about this tool