How to learn Reproducibility With Seeds And Configs for Training And Optimization in NLP Engineer for free

Why this matters

As an NLP Engineer, you will run many experiments: changing tokenizers, learning rates, batch sizes, and model architectures. Stakeholders expect you to explain differences in results. Without reproducibility, you cannot trust comparisons or debug issues.

Fair A/B comparisons: Re-run the same experiment a week later and get the same numbers.
Reliable collaboration: Teammates can replicate your training and review your findings.
Auditability: You can trace any model artifact back to the exact config and seed.
Faster debugging: If a run goes wrong, you can re-create it to inspect step-by-step.

Concept explained simply

Reproducibility means that if you repeat the same steps with the same data, code, settings, and random seeds, you should get the same results.

Seeds: A seed is a number that initializes pseudo-random generators. Set one seed and every random choice becomes predictable.
Configs: A config is the full set of decisions for a run (hyperparameters, data paths, preprocessing flags, and seed). A saved config is the recipe to rebuild a model.

Mental model

Imagine an Experiment Box. Inputs: dataset, code, config (including seed), and environment. The box outputs: logs, metrics, and model. If you feed the box the same inputs again, you should get the same outputs. Seeds lock the randomness; configs lock the decisions.

Who this is for

NLP engineers and MLEs running experiments on tokenization, pretraining/fine-tuning, and hyperparameter search.
Data scientists who need repeatable pipelines for model comparisons and reports.

Prerequisites

Comfort with Python and basic training loops (PyTorch or similar).
Understanding of datasets, batching, and shuffling.
Basic JSON handling in Python.

Learning path

Seed everything: Python, NumPy, PyTorch, and DataLoader workers.
Use deterministic settings (where possible) and understand trade-offs.
Create a single JSON config per run and save it with outputs.
Compute a stable run_id from the config to track artifacts.
Verify reproducibility: compare metrics and model parameter hashes across runs.
Automate: a script that accepts a config and produces identical results across reruns.

Worked examples

Example 1 — One function to seed everything (Python, NumPy, PyTorch)

Use one place to set seeds and deterministic flags.

import os, random, numpy as np, torch

def set_seed(seed: int):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # Deterministic behavior in CUDA
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Error on nondeterministic ops (set warn_only=True to avoid exceptions)
    torch.use_deterministic_algorithms(True, warn_only=True)

set_seed(42)

Tip: If you see errors about nondeterministic ops, either switch to deterministic alternatives, or temporarily use warn_only=True while you identify the source.

Example 2 — Deterministic DataLoader with multiple workers

For reproducible shuffling and augmentations, seed worker processes and use a Generator.

from torch.utils.data import DataLoader, Dataset
import numpy as np, torch, random

class ToyDataset(Dataset):
    def __init__(self, n=100):
        self.x = np.arange(n)
    def __len__(self):
        return len(self.x)
    def __getitem__(self, i):
        # Example of worker-local randomness:
        noise = np.random.randint(0, 2)  # 0 or 1
        return int(self.x[i] + noise)

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

seed = 123
set_seed(seed)

# Controls shuffling order deterministically
g = torch.Generator()
g.manual_seed(seed)

ds = ToyDataset(20)
dl = DataLoader(
    ds, batch_size=4, shuffle=True, num_workers=2,
    worker_init_fn=seed_worker, generator=g
)

batches = [batch for batch in dl]
print(batches)

Run this block multiple times: you should get the same batch contents each time.

Example 3 — Config-driven run with stable run_id

Keep run decisions in a JSON config and derive a stable run_id to name outputs.

import json, hashlib

config = {
  "seed": 2024,
  "model": "distilbert-base-uncased",
  "training": {"epochs": 3, "batch_size": 32, "lr": 5e-5},
  "data": {"train_path": "data/train.csv", "val_path": "data/val.csv"}
}

# Canonical JSON string for stable hashing
cfg_str = json.dumps(config, sort_keys=True, separators=(",", ":"))
run_id = hashlib.sha1(cfg_str.encode()).hexdigest()[:8]
print("run_id:", run_id)

# Save alongside outputs
with open(f"outputs/{run_id}.json", "w") as f:
    f.write(cfg_str)

Any change to hyperparameters or data paths changes run_id. You always know which config produced which artifact.

Example 4 — Verifying runs by hashing model parameters

import hashlib, torch

def state_hash(model: torch.nn.Module):
    h = hashlib.sha256()
    for p in model.state_dict().values():
        h.update(p.detach().cpu().numpy().tobytes())
    return h.hexdigest()[:16]

# After training two times with the same config and seed:
# h1 = state_hash(model_run1)
# h2 = state_hash(model_run2)
# print("identical:", h1 == h2)

If hashes match and metrics are identical, your run is reproducible under the current environment.

Practical setup checklist

[ ] Single set_seed function called at program start.
[ ] DataLoader uses generator and worker_init_fn.
[ ] One JSON config per run saved with outputs.
[ ] Stable run_id derived from the config.
[ ] Deterministic flags set; aware of speed trade-offs.
[ ] Model parameter hash and metrics compared across reruns.
[ ] Library versions pinned (requirements) and recorded with the run.

Exercises

These mirror the exercises below. Do them locally and verify determinism. Everyone can take the test; only logged-in users get saved progress.

Seed and verify: Build a tiny training loop and confirm identical losses and parameter hash across 3 reruns.
Config + run_id: Create a JSON config, compute run_id, and show that changing a single hyperparameter changes run_id and results.

Common mistakes and self-check

Forgetting DataLoader worker seeding: You set torch.manual_seed but use num_workers>0 without worker_init_fn or generator.
Not saving the exact config: You tweak arguments interactively and forget which set produced the model.
Changing library versions: Different tokenizers or framework versions can alter results. Record versions with each run.
Nondeterministic ops: Some GPU ops are nondeterministic. Enable deterministic algorithms to catch them or accept speed trade-offs.
Partial seeding: You seed Python but not NumPy or CUDA, leading to drift.

Self-check steps:

Rerun the same config 3 times; verify identical metrics and model parameter hashes.
Diff the saved configs; there should be zero differences.
Record and compare package versions; keep them constant when testing determinism.

Mini challenge

Take a small text classification task. Create a config file that includes seed, tokenizer settings, batch size, learning rate, and epochs. Run training twice and report: (1) run_id, (2) validation accuracy sequence, (3) model state hash. If anything differs, identify the cause and fix it.

Practical projects

Reproducible fine-tune: Train a small transformer twice with the same config and confirm identical metrics and state hash.
Config-driven grid: Sweep 3 seeds × 2 learning rates using config overrides. Save results under run_id-based folders.
Reproduction report: Attempt to reproduce your own baseline a week later on a clean machine using the saved config and environment notes.

Next steps

Automate config loading and result logging in a single entry script.
Add a simple environment manifest (Python version, key package versions) to every run output.
Introduce seed sweeps (3–5 seeds) for robust model comparison where exact determinism is not required but stability is.

Menu

Reproducibility With Seeds And Configs

Table of Contents

Why this matters

Concept explained simply

Mental model

Who this is for

Prerequisites

Learning path

Worked examples

Practical setup checklist

Exercises

Common mistakes and self-check

Mini challenge

Practical projects

Next steps

Practice Exercises

Deterministic training loop baseline

Instructions

Expected Output

Config file + stable run_id

Reproducibility With Seeds And Configs — Quick Test

Have questions about Reproducibility With Seeds And Configs?

AI Assistant