Why this matters
As a Computer Vision Engineer, you will train models, iterate on data and augmentations, and deploy pipelines. If your results cannot be reproduced, you cannot trust improvements, debug regressions, or collaborate reliably. Reproducibility lets teammates rerun your experiment and get the same metrics, artifacts, and decisions.
- Hiring/peer review: Share a run ID and let others match your results.
- Production: Roll back to a known-good model and dataset snapshot.
- Research: Prove that a change (augmentation, loss) truly helps.
Who this is for
- Beginners who have trained a few vision models and want consistent results.
- Engineers moving from notebooks to collaborative, traceable work.
- Researchers who need deterministic baselines and auditable experiments.
Prerequisites
- Basic Python and Git.
- Familiarity with PyTorch or a similar deep learning framework.
- Comfort with the command line.
Concept explained simply
Reproducibility means someone else can run your code later and get the same result. It requires freezing four things:
- Code: exact commit and configuration.
- Data: the same files, content, and order.
- Environment: pinned packages, OS/GPU settings that affect math.
- Randomness: fixed seeds and deterministic algorithms.
Mental model: The 4-box lock
Imagine four lockboxes labeled Code, Data, Environment, and Randomness. Your experiment is secure only when all four are locked. If any box is open, results can drift.
Core components of reproducible vision workflows
1) Data immutability and versioning
- Create a dataset snapshot folder (e.g., data/cats-dogs-v1/) that never changes.
- Store a manifest file (paths, sizes, hashes) to prove the snapshot content.
- Never auto-download at train time without pinning exact version and verifying size/hash.
2) Environment pinning
- Freeze package versions (e.g., requirements.txt with exact versions).
- Record CUDA/cuDNN versions from your environment.
- Optional but helpful: containerize for consistent OS and drivers.
3) Randomness control
- Set seeds for Python, NumPy, PyTorch.
- Enable deterministic algorithms in your framework when needed.
- Seed data loaders and augmentation RNGs; avoid time-based randomness.
4) Determinism vs performance
Deterministic settings can be slower. Use deterministic mode for baselines, debugging, and comparisons. You can later relax some settings for speed, but document the change.
5) Config-driven pipelines
- Use a single config file (YAML/JSON) that declares data paths, transforms, model/loss, training hyperparameters, and seeds.
- Log the exact config with each run; never rely on hidden defaults.
6) Experiment tracking and metadata
- For each run, save: commit hash, config, dataset manifest ID, environment lock file, metrics, and artifacts (model weights).
- Give each run a unique, human-readable name (e.g., 2026-01-05_resnet18_augA_v1).
7) Checkpoints and artifacts
- Save model weights, training/eval logs, confusion matrices, and sample predictions.
- Keep the "best" checkpoint and the last checkpoint; record the selection metric.
8) CI smoke tests and invariants
- Run a 1-epoch or 50-steps smoke test on a tiny subset on each commit.
- Track invariants: at least some training loss decrease, metrics within expected ranges.
Worked examples
Example 1: Deterministic PyTorch training loop
Minimal setup that yields the same loss curve across runs on the same machine.
import os, random, torch, numpy as np
def set_seed(seed: int = 42):
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed = 42
set_seed(seed)
# Example: deterministic DataLoader
from torch.utils.data import DataLoader, TensorDataset
X = torch.randn(512, 3, 32, 32)
y = torch.randint(0, 10, (512,))
def seed_worker(worker_id):
worker_seed = seed
np.random.seed(worker_seed)
random.seed(worker_seed)
g = torch.Generator()
g.manual_seed(seed)
ds = TensorDataset(X, y)
loader = DataLoader(ds, batch_size=64, shuffle=True, num_workers=0, worker_init_fn=seed_worker, generator=g)
# Tiny model
import torch.nn as nn
model = nn.Sequential(nn.Flatten(), nn.Linear(3*32*32, 10)).cuda() if torch.cuda.is_available() else nn.Sequential(nn.Flatten(), nn.Linear(3*32*32, 10))
opt = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.CrossEntropyLoss()
losses = []
for epoch in range(2):
for xb, yb in loader:
if torch.cuda.is_available():
xb, yb = xb.cuda(), yb.cuda()
opt.zero_grad()
loss = loss_fn(model(xb), yb)
loss.backward()
opt.step()
losses.append(float(loss.detach().cpu()))
print(round(sum(losses[:5]), 6)) # Use this number to compare runs
Re-run this script twice; the printed sum should match exactly if everything is deterministic.
Example 2: Dataset manifest with hashes
Create and verify a manifest so your code knows exactly which files it trained on.
import hashlib, json, os
from pathlib import Path
root = Path("data/cats-dogs-v1")
files = sorted([p for p in root.rglob("*.jpg")])
def sha256_of(path):
h = hashlib.sha256()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk)
return h.hexdigest()
manifest = [{
"relpath": str(p.relative_to(root)),
"bytes": os.path.getsize(p),
"sha256": sha256_of(p)
} for p in files]
with open(root / "manifest.json", "w") as f:
json.dump({"count": len(manifest), "files": manifest}, f, indent=2)
print(f"Wrote manifest for {len(manifest)} files")
Verification step before training:
import json, hashlib
from pathlib import Path
root = Path("data/cats-dogs-v1")
with open(root / "manifest.json") as f:
m = json.load(f)
for rec in m["files"]:
p = root / rec["relpath"]
assert p.is_file(), f"Missing: {p}"
assert p.stat().st_size == rec["bytes"], f"Size mismatch: {p}"
# Optional: recompute hash for full integrity
# ... as shown above ...
print("Dataset verified.")
Example 3: Config-driven augmentation pipeline
Declare transforms and hyperparameters in a single config file.
# config.yaml
seed: 1337
dataset: data/cats-dogs-v1
train:
batch_size: 64
epochs: 10
lr: 0.001
augment:
resize: [224, 224]
hflip_prob: 0.5
color_jitter: {brightness: 0.1, contrast: 0.1, saturation: 0.1, hue: 0.05}
import yaml, random, numpy as np, torch
from torchvision import transforms
with open("config.yaml") as f:
cfg = yaml.safe_load(f)
# Seed everything
seed = cfg["seed"]
random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
a_cfg = cfg["augment"]
train_tf = transforms.Compose([
transforms.Resize(a_cfg["resize"]),
transforms.RandomHorizontalFlip(p=a_cfg["hflip_prob"]),
transforms.ColorJitter(**a_cfg["color_jitter"]),
transforms.ToTensor()
])
print("Transforms locked from config. Diff the YAML for changes.")
Checklist (self-audit)
- I can re-run the same experiment twice and get identical metrics on the same machine.
- My data snapshot is immutable and verified by a manifest/hash.
- My environment is pinned (exact versions recorded).
- All randomness (training, data loading, augmentations) is seeded.
- Every run logs: commit, config, dataset ID, environment lock, metrics, and artifacts.
- I have a tiny smoke test that runs quickly and enforces invariants.
Exercises
You can do the exercises for free. Progress saving is available to logged-in users.
-
Exercise 1 — Deterministic training mini-run (ID: ex1)
Make a 2-epoch training run fully deterministic (same loss numbers on two runs). Log the seed, config, and sum of the first 5 losses. Compare run A vs B and confirm identical values.
-
Exercise 2 — Dataset manifest and verify gate (ID: ex2)
Create a manifest.json with relpath, size, sha256 for all images in a snapshot. Add a pre-train verification step that fails if any file is missing or changed.
Common mistakes and self-check
- Forgetting to seed data loader workers. Self-check: set num_workers=0 and see if results stabilize; then add worker_init_fn and generator.
- Leaving cudnn.benchmark=True. Self-check: print the flag and ensure it is False when you need determinism.
- Augmentations with hidden RNG. Self-check: pass the same seed and verify transform outputs on the same image are identical across runs.
- Auto-updating datasets. Self-check: compare current files against a stored manifest.
- Unpinned dependencies. Self-check: rebuild the environment from your lock file on a clean machine or virtual environment.
Practical projects
- Reproducible CIFAR-10 baseline: deterministic training, config file, manifest of class indices, saved artifacts and logs.
- Augmentation ablation suite: 3 configs (no aug, light, heavy) with identical seeds and a comparison report.
- Tiny CI smoke test: run 50 steps on 100 images per commit, asserting loss decreases at least 5%.
Mini challenge
Take any previous project of yours and convert it into a fully reproducible run: lock data snapshot, pin environment, seed everything, and export a single run folder containing config, logs, and artifacts. Ask a friend to run it and match your metrics.
Learning path
- Start: Deterministic basics (seeds, cudnn flags, config files).
- Next: Data versioning with manifests and immutability.
- Then: Environment pinning and optional containerization.
- Finally: Experiment tracking, artifact management, and CI smoke tests.
Next steps
- Turn your current notebook into a script that reads a config and writes a run folder.
- Add an automated data verify step and a fast smoke test.
- Take the quick test below to confirm understanding.