Why this matters
As an NLP Engineer, you will run many experiments: changing tokenizers, learning rates, batch sizes, and model architectures. Stakeholders expect you to explain differences in results. Without reproducibility, you cannot trust comparisons or debug issues.
- Fair A/B comparisons: Re-run the same experiment a week later and get the same numbers.
- Reliable collaboration: Teammates can replicate your training and review your findings.
- Auditability: You can trace any model artifact back to the exact config and seed.
- Faster debugging: If a run goes wrong, you can re-create it to inspect step-by-step.
Concept explained simply
Reproducibility means that if you repeat the same steps with the same data, code, settings, and random seeds, you should get the same results.
- Seeds: A seed is a number that initializes pseudo-random generators. Set one seed and every random choice becomes predictable.
- Configs: A config is the full set of decisions for a run (hyperparameters, data paths, preprocessing flags, and seed). A saved config is the recipe to rebuild a model.
Mental model
Imagine an Experiment Box. Inputs: dataset, code, config (including seed), and environment. The box outputs: logs, metrics, and model. If you feed the box the same inputs again, you should get the same outputs. Seeds lock the randomness; configs lock the decisions.
Who this is for
- NLP engineers and MLEs running experiments on tokenization, pretraining/fine-tuning, and hyperparameter search.
- Data scientists who need repeatable pipelines for model comparisons and reports.
Prerequisites
- Comfort with Python and basic training loops (PyTorch or similar).
- Understanding of datasets, batching, and shuffling.
- Basic JSON handling in Python.
Learning path
- Seed everything: Python, NumPy, PyTorch, and DataLoader workers.
- Use deterministic settings (where possible) and understand trade-offs.
- Create a single JSON config per run and save it with outputs.
- Compute a stable run_id from the config to track artifacts.
- Verify reproducibility: compare metrics and model parameter hashes across runs.
- Automate: a script that accepts a config and produces identical results across reruns.
Worked examples
Example 1 — One function to seed everything (Python, NumPy, PyTorch)
Use one place to set seeds and deterministic flags.
import os, random, numpy as np, torch
def set_seed(seed: int):
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Deterministic behavior in CUDA
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Error on nondeterministic ops (set warn_only=True to avoid exceptions)
torch.use_deterministic_algorithms(True, warn_only=True)
set_seed(42)
Tip: If you see errors about nondeterministic ops, either switch to deterministic alternatives, or temporarily use warn_only=True while you identify the source.
Example 2 — Deterministic DataLoader with multiple workers
For reproducible shuffling and augmentations, seed worker processes and use a Generator.
from torch.utils.data import DataLoader, Dataset
import numpy as np, torch, random
class ToyDataset(Dataset):
def __init__(self, n=100):
self.x = np.arange(n)
def __len__(self):
return len(self.x)
def __getitem__(self, i):
# Example of worker-local randomness:
noise = np.random.randint(0, 2) # 0 or 1
return int(self.x[i] + noise)
def seed_worker(worker_id):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
seed = 123
set_seed(seed)
# Controls shuffling order deterministically
g = torch.Generator()
g.manual_seed(seed)
ds = ToyDataset(20)
dl = DataLoader(
ds, batch_size=4, shuffle=True, num_workers=2,
worker_init_fn=seed_worker, generator=g
)
batches = [batch for batch in dl]
print(batches)
Run this block multiple times: you should get the same batch contents each time.
Example 3 — Config-driven run with stable run_id
Keep run decisions in a JSON config and derive a stable run_id to name outputs.
import json, hashlib
config = {
"seed": 2024,
"model": "distilbert-base-uncased",
"training": {"epochs": 3, "batch_size": 32, "lr": 5e-5},
"data": {"train_path": "data/train.csv", "val_path": "data/val.csv"}
}
# Canonical JSON string for stable hashing
cfg_str = json.dumps(config, sort_keys=True, separators=(",", ":"))
run_id = hashlib.sha1(cfg_str.encode()).hexdigest()[:8]
print("run_id:", run_id)
# Save alongside outputs
with open(f"outputs/{run_id}.json", "w") as f:
f.write(cfg_str)
Any change to hyperparameters or data paths changes run_id. You always know which config produced which artifact.
Example 4 — Verifying runs by hashing model parameters
import hashlib, torch
def state_hash(model: torch.nn.Module):
h = hashlib.sha256()
for p in model.state_dict().values():
h.update(p.detach().cpu().numpy().tobytes())
return h.hexdigest()[:16]
# After training two times with the same config and seed:
# h1 = state_hash(model_run1)
# h2 = state_hash(model_run2)
# print("identical:", h1 == h2)
If hashes match and metrics are identical, your run is reproducible under the current environment.
Practical setup checklist
- [ ] Single set_seed function called at program start.
- [ ] DataLoader uses generator and worker_init_fn.
- [ ] One JSON config per run saved with outputs.
- [ ] Stable run_id derived from the config.
- [ ] Deterministic flags set; aware of speed trade-offs.
- [ ] Model parameter hash and metrics compared across reruns.
- [ ] Library versions pinned (requirements) and recorded with the run.
Exercises
These mirror the exercises below. Do them locally and verify determinism. Everyone can take the test; only logged-in users get saved progress.
- Seed and verify: Build a tiny training loop and confirm identical losses and parameter hash across 3 reruns.
- Config + run_id: Create a JSON config, compute run_id, and show that changing a single hyperparameter changes run_id and results.
Common mistakes and self-check
- Forgetting DataLoader worker seeding: You set torch.manual_seed but use num_workers>0 without worker_init_fn or generator.
- Not saving the exact config: You tweak arguments interactively and forget which set produced the model.
- Changing library versions: Different tokenizers or framework versions can alter results. Record versions with each run.
- Nondeterministic ops: Some GPU ops are nondeterministic. Enable deterministic algorithms to catch them or accept speed trade-offs.
- Partial seeding: You seed Python but not NumPy or CUDA, leading to drift.
Self-check steps:
- Rerun the same config 3 times; verify identical metrics and model parameter hashes.
- Diff the saved configs; there should be zero differences.
- Record and compare package versions; keep them constant when testing determinism.
Mini challenge
Take a small text classification task. Create a config file that includes seed, tokenizer settings, batch size, learning rate, and epochs. Run training twice and report: (1) run_id, (2) validation accuracy sequence, (3) model state hash. If anything differs, identify the cause and fix it.
Practical projects
- Reproducible fine-tune: Train a small transformer twice with the same config and confirm identical metrics and state hash.
- Config-driven grid: Sweep 3 seeds Ă— 2 learning rates using config overrides. Save results under run_id-based folders.
- Reproduction report: Attempt to reproduce your own baseline a week later on a clean machine using the saved config and environment notes.
Next steps
- Automate config loading and result logging in a single entry script.
- Add a simple environment manifest (Python version, key package versions) to every run output.
- Introduce seed sweeps (3–5 seeds) for robust model comparison where exact determinism is not required but stability is.