luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Experiment Tracking And Reproducibility

Learn Experiment Tracking And Reproducibility for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As a Computer Vision Engineer, you will run many training jobs: tuning augmentations, trying new backbones, swapping loss functions, or updating datasets. Without disciplined experiment tracking, it is easy to lose the run that actually worked, misreport results, or fail to reproduce a model for deployment or audits.

  • Product reality: You must explain why model A beat model B and be able to re-train it later.
  • Teamwork: Colleagues should be able to re-run your best configuration.
  • Governance: Some domains require traceable experiments with data and code versions.
Real tasks you will face
  • Compare a new augmentation policy vs. baseline and decide if it ships.
  • Reproduce a production model after updating CUDA drivers.
  • Roll back to a previous dataset version following a label policy change.
  • Audit a result months later: which data split, commit, and seed produced it?

Concept explained simply

An experiment is a single training run with a defined configuration. Tracking means recording everything that can change results so you can compare and reproduce.

  • Parameters: hyperparameters, architecture choices, augmentations.
  • Data: dataset version, split definitions, preprocessing steps.
  • Code: commit hash, scripts used.
  • Environment: OS, Python, CUDA, library versions, hardware.
  • Outputs: metrics, artifacts (models, plots, predictions).

Mental model

Think of a lab notebook for ML: each run has a title, a recipe (config), ingredients (data/code/env), the oven (hardware), and the results (metrics/artifacts). If someone follows the notebook, they should get the same or statistically equivalent result.

Levels of reproducibility

  • Exact: bitwise-identical results. Hard with GPUs due to nondeterministic kernels.
  • Statistical: results within small variance across seeds. Usually the practical goal.
  • Functional: performance good enough for the product even if not identical.
Where randomness sneaks in
  • Data shuffling and augmentation randomness.
  • GPU algorithms that are nondeterministic for speed.
  • Parallel data loading order.

Mitigations: set seeds, use deterministic flags where available, fix num_workers order, and run multiple seeds to report mean±std.

Minimal tracking kit (works anywhere)

Track these for every run:

  • Run ID and human-friendly name.
  • Config: YAML/JSON with all parameters.
  • Code: git commit SHA and changed files (if any).
  • Environment: OS, Python, CUDA/cuDNN, key libs with versions.
  • Data: dataset version ID, data preprocessing hash, and exact split indices or RNG seed used to create splits.
  • Randomness: global seed(s), deterministic flags used.
  • Hardware: GPU model(s), CPU, RAM.
  • Metrics: train/val curves, best epoch, final metrics with seed-aggregated stats if applicable.
  • Artifacts: model weights, prediction samples, confusion matrix/PR curves, training logs.
  • Timing: start/end timestamps, training duration.
Example local run folder layout
runs/
  2026-01-05_1432_resnet50_augA/
    config.yaml
    code.txt            # git SHA and status
    env.txt             # python, torch, cuda, drivers
    data.json           # dataset version, split, hashes
    metrics.json        # best_acc, mAP, loss curves
    notes.txt           # short intent and observations
    artifacts/
      model.pt
      cm.png
      val_predictions.csv

Worked examples

Example 1 — Hyperparameter sweep

Task: Tune learning rate for ResNet-50 on a defect classification dataset.

  • Runs: lr in {1e-4, 3e-4, 1e-3}, seed=42, same data split.
  • Outcome: 1e-3 overfits early; 1e-4 underfits; 3e-4 gives best val accuracy.
  • Tracking win: Because config, seed, and data version are fixed, differences are attributable to lr.
Interpretation

Compare validation curves and best epoch. Label the winning run as baseline_v2 and save its weights and metrics.

Example 2 — Data version change

Task: Add 10% new images with revised labels.

  • Dataset v1.2 → v1.3, same code/params.
  • Result: mAP drops from 0.54 to 0.51.
  • Tracking win: data.json shows label policy changed; confusion matrix highlights specific class regressions. Decision: relabel a subset and re-run.

Example 3 — Randomness and multiple seeds

Task: Report robust performance for paper-level quality.

  • Runs: 5 seeds with same config.
  • Result: mIoU mean 0.702 ± 0.006.
  • Tracking win: statistical reproducibility documented; any single run is expected near this range.
GPU determinism note

Even with seeds, some kernels are nondeterministic. Enable deterministic options when available and avoid algorithms that break determinism if exact reproducibility is required.

Practical setup — step by step

  1. Name your run: concise, intent-focused (e.g., resnet50_augA_lr3e-4_seed42).
  2. Create run folder: runs/YYYYMMDD-HHMMSS_runname.
  3. Save config: config.yaml capturing ALL params (model, optimizer, augmentations, scheduler, batch size, epochs).
  4. Record code: write git SHA and dirty status to code.txt.
  5. Freeze environment: write python -m pip freeze or conda env export to env.txt, plus CUDA/driver info.
  6. Log data: dataset version ID, split seed, class counts, and checksum of file lists to data.json.
  7. Seed everything: set seeds for Python, NumPy, and framework; consider deterministic flags and dataloader settings.
  8. Stream metrics: append JSON lines per epoch with train/val metrics; save best model and final artifacts.
  9. Write notes: brief purpose and observations in notes.txt.
Minimal Python pseudocode
import os, json, random, numpy as np, torch
from datetime import datetime

run_name = "resnet50_augA_lr3e-4_seed42"
run_dir = os.path.join("runs", datetime.now().strftime("%Y-%m-%d_%H%M%S_") + run_name)
os.makedirs(os.path.join(run_dir, "artifacts"), exist_ok=True)

# 1) Save config
config = {"model":"resnet50","lr":3e-4,"batch_size":64,"epochs":50,
          "aug":"policyA","optimizer":"adamw","seed":42}
open(os.path.join(run_dir, "config.yaml"), "w").write("\n".join([f"{k}: {v}" for k,v in config.items()]))

# 2) Seed
seed = config["seed"]
random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Determinism options (may slow training)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# 3) Data/meta
data_meta = {"dataset":"defects","version":"v1.2","split_seed":123,
             "train_count":12000, "val_count":1500}
json.dump(data_meta, open(os.path.join(run_dir, "data.json"), "w"))

# 4) Training loop (pseudo)
metrics = []
best = {"val_acc":0, "epoch":-1}
for epoch in range(config["epochs"]):
    train_loss, train_acc = 0.45, 0.90  # pretend
    val_loss, val_acc = 0.40, 0.91      # pretend
    rec = {"epoch":epoch,"train_loss":train_loss,"train_acc":train_acc,
           "val_loss":val_loss,"val_acc":val_acc}
    metrics.append(rec)
    if val_acc > best["val_acc"]:
        best = {"val_acc":val_acc, "epoch":epoch}
        torch.save({"state_dict":"..."}, os.path.join(run_dir, "artifacts", "model.pt"))

json.dump({"best":best, "history":metrics}, open(os.path.join(run_dir, "metrics.json"), "w"))
open(os.path.join(run_dir, "notes.txt"), "w").write("Tuned lr with aug policy A. Best at epoch %d" % best["epoch"]) 

Exercises

Do these to build muscle memory. You can take the Quick Test afterwards; everyone can take it for free. If you log in, we save your progress.

Exercise 1 — Create a reproducible run template

Goal: Build a run folder for a small classification experiment with all required files stubbed.

  • Create runs/YYYYMMDD-HHMMSS_smallcnn_baseline.
  • Add config.yaml, code.txt, env.txt, data.json, metrics.json (empty array), notes.txt, and artifacts/.
  • Write 2–3 lines in notes.txt on experiment intent.
Checklist
  • Run name is descriptive.
  • All files exist, even if placeholders.
  • Config includes model, optimizer, lr, batch size, epochs, augmentations, seed.
  • Data file has dataset name, version, split definition.
  • Environment lists Python and core libs.

Exercise 2 — Reproduce a result with seeds

Goal: Train the same small model 3 times with seeds 0, 1, 2 and record mean±std of val accuracy.

  • Keep config identical except seed; distinct run folders per seed.
  • Aggregate metrics into a single summary.json with mean and std.
Checklist
  • Three separate run directories created.
  • Seeds logged in config.
  • Aggregation file includes list of accuracies, mean, std.
  • Notes mention whether variation seems acceptable.

Common mistakes and self-check

  • Missing seeds: If you cannot rerun within ± small variance, you likely forgot to seed or used nondeterministic ops unchecked.
  • Changing multiple factors at once: You cannot attribute improvements. Change one dimension per experiment or run controlled grids.
  • Unversioned data splits: If your val set changes, comparisons are invalid. Save split indices or RNG seed and data manifest.
  • Not logging code state: A dirty working tree means someone else cannot re-run the exact code. Record commit and diffs.
  • Ignoring environment: CUDA/driver mismatches can break reproducibility. Log versions.
  • No artifacts: Without saved weights and predictions, later analysis is blocked.
Self-check
  • Can a teammate run your config and reach similar metrics this week? Next month?
  • Can you explain the exact cause of a performance change with evidence?
  • Can you regenerate the model file that shipped?

Practical projects

  • Image classification baseline: CIFAR-10 or a similar small dataset. Track a baseline, an augmentation change, and a learning rate sweep. Report mean±std over 3 seeds.
  • Object detection mini: Train a small detector on a subset of COCO-like data. Track dataset version, anchor settings, NMS thresholds. Save PR curves and model weights.
  • Segmentation lite: Train a tiny U-Net on medical slices. Track preprocessing (normalize/clip), loss function choice, and seed control. Save mIoU per class and confusion matrix.

Who this is for

  • Junior to mid-level Computer Vision Engineers who train and compare models regularly.
  • MLOps-minded practitioners formalizing their workflow.

Prerequisites

  • Basic Python and a deep learning framework (PyTorch or similar).
  • Familiarity with Git commits and branches.
  • Understanding of training/validation splits and core CV metrics.

Learning path

  • Learn to structure configs and seed runs.
  • Add data and environment versioning.
  • Introduce multi-seed evaluation and ablations.
  • Automate logging and artifact saving.
  • Adopt team conventions for naming and documentation.

Next steps

  • Automate your template: a small script that creates run folders and files.
  • Add plotting of curves and automatic best-model saving.
  • Introduce cross-validation or bootstrapping where appropriate.

Mini challenge

Run a 2x2 ablation (two learning rates x with/without color jitter) over 3 seeds each. Produce a single table with mean±std val accuracy. Write a two-sentence conclusion on which factor mattered most and whether the interaction effect is meaningful.

Quick Test

Everyone can take the test for free. If you log in, we save your progress so you can resume later.

Practice Exercises

2 exercises to complete

Instructions

Build a run folder for a small classification experiment with placeholders for all critical tracking files.

  • Create directory: runs/YYYYMMDD-HHMMSS_smallcnn_baseline
  • Add files: config.yaml, code.txt, env.txt, data.json, metrics.json (empty array), notes.txt, and artifacts/ folder
  • Write 2–3 lines in notes.txt describing intent
Expected Output
A run directory containing the listed files. config.yaml includes model, optimizer, lr, batch size, epochs, augmentations, seed. data.json has dataset name, version, split info. env.txt lists Python and library versions.

Experiment Tracking And Reproducibility — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Experiment Tracking And Reproducibility?

AI Assistant

Ask questions about this tool