Why this matters
As an Applied Scientist, you will run many experiments: model training, feature ablations, hyperparameter sweeps, and online A/B tests. Reproducible tracking ensures that you (and teammates) can: (1) re-run a result to verify it, (2) debug issues months later, (3) meet compliance/audit needs, and (4) make confident product decisions.
- Validate a model improvement before rollout.
- Explain why metric X changed after a library upgrade.
- Compare two runs fairly with identical data and code states.
- Trace A/B test outcomes back to exact assignment rules and metric definitions.
Concept explained simply
Reproducible tracking means every experiment has a complete, frozen record: what code, data, configuration, environment, and randomness produced which results and artifacts.
Mental model: The "Run Container"
Treat each experiment run as a sealed container with all ingredients listed and labeled. If you hand the container to a teammate, they should recreate the dish (result) without guessing.
What to capture for every run
Minimum Run Record (start here)
- Run metadata: unique run_id, timestamp, author, project name.
- Config: hyperparameters and options (frozen JSON/YAML).
- Data snapshot: dataset version or hash; split indices or split seed.
- Code state: commit hash or version tag; list of changed files if any.
- Environment: OS, Python/version, library versions, hardware notes (CPU/GPU).
- Randomness control: global seeds used; determinism flags if applicable.
- Metrics: final and per-epoch metrics with exact definitions.
- Artifacts: model weights, predictions sample, plots (e.g., PR curve, confusion matrix).
- Notes: brief rationale and anomalies observed.
Nice-to-have (for teams and audits)
- Full training/validation/test indices or their hashes.
- Data schema snapshot (column names, dtypes), and feature generation configs.
- Exact SQL/query used to build the dataset.
- Metric computation code snippet or pseudocode with version.
- Compute cost/time and resource usage summary.
- Links to related runs (parent sweep, ablation runs) as IDs.
Worked examples
Example 1: Single model training run
Goal: Compare a new learning rate schedule.
experiments/fraud-model/2025-01-07_14-32-11_run-023/ ββ config.yaml # lr, batch_size, scheduler=cosine, epochs ββ code_commit.txt # 1a2b3c4 (commit hash) ββ env.txt # Python 3.x, numpy x.y, torch x.y ββ seeds.txt # global=1234, data_split=42 ββ data_snapshot.txt # dataset_v23 parquet hash=0x9af2... ββ splits/ # train_ids.txt, val_ids.txt ββ metrics.json # per-epoch + final AUC, PR-AUC, F1 (threshold=0.5) ββ artifacts/ β ββ model.pt β ββ pr_curve.png β ββ confusion_matrix.png ββ notes.md # why we tried cosine; anomalies, next ideas
Outcome: Anyone can re-run with the same commit, config, seeds, and data snapshot.
Example 2: Online A/B test (checkout flow)
Goal: Test a new address validation UX.
- Experiment definition: experiment_id, variants A/B with exact eligibility rules, start/end timestamps, exposure unit (user_id), traffic split, and holdout logic.
- Assignment log: for each user, the assignment time, variant, and experiment_id.
- Metric definitions: conversion_v2 = purchases/eligible_users within 7 days; window and filters recorded.
- Event schema version: page_view v3, purchase v5 noted.
- Analysis plan: primary metric, guardrails, stopping rules frozen before launch.
- Query snapshot: SQL used for final analysis saved as text.
Outcome: Weeks later, you can reproduce the exact effect size and understand discrepancies if schemas changed.
Example 3: Hyperparameter sweep (grid or Bayesian)
experiments/nlp-intent/2025-01-07_sweep-07/ ββ sweep.yaml # search space and budget ββ baseline_config.yaml # shared defaults ββ runs/ β ββ run-101/ ... # each a full run container as in Example 1 β ββ run-102/ ... β ββ run-103/ ... ββ leaderboard.csv # run_id, val_F1, train_time, params_signature
Outcome: You can re-train the winning run and verify the leaderboard sorting by the same metrics.
Step-by-step: Set up a reproducible experiment record
- Create a consistent folder template for runs (see examples above).
- Assign a unique run_id: combine date-time and a short counter or hash.
- Freeze the config early: save config.yaml before training starts.
- Set and record seeds: one for global libs; one for data splitting.
- Snapshot data: save dataset version/hash and split indices or deterministic split rules.
- Record environment: Python and library versions; hardware notes.
- Log metrics as you train/test: include definitions and thresholds.
- Save artifacts: model, plots, sample predictions, confusion/PR curves.
- Write notes: what changed, why, and anything unusual.
- Automate gradually: once the template works, script the creation of files and folders.
Exercises
Do these hands-on tasks. They mirror the graded exercises below. Keep your outputs simple and readable.
Exercise 1 β Design a Run Record Template
Create a minimal yet complete directory layout for a supervised learning experiment. Include file placeholders for config, environment, data snapshot, seeds, metrics, and artifacts. Add one note that explains the primary metric and threshold.
- Deliverable: A folder tree sketch and short notes (see an example structure in Worked Example 1).
- Checklist:
- Unique run_id and timestamp
- config.yaml frozen
- data snapshot/version and split indices or seed
- code commit recorded
- metrics with definitions
- artifacts (model + 1 plot)
- notes with rationale
Exercise 2 β A/B Test Repro Kit
Draft a textual "repro kit" for an A/B test: list every field you would save to reconstruct the analysis a month later. Include assignment details, metric definitions, event schema versions, and the final query text placeholder.
- Deliverable: A bullet list of fields and a short paragraph on how to use them to re-run the analysis.
- Checklist:
- experiment_id and variant rules
- assignment logs (who, when, which variant)
- metric formulas and windows
- event schema versions
- analysis query snapshot
- analysis plan (primary/guardrail metrics, stopping)
Common mistakes and self-check
- Only saving final metrics. Fix: log per-epoch/step and metric definitions.
- Forgetting split determinism. Fix: save indices or the exact seed and rule.
- Not recording code state. Fix: save commit hash and note local changes.
- Ignoring environment. Fix: capture library versions and OS.
- Inconsistent run naming. Fix: timestamp+counter or UUID with project prefix.
- Changing metrics mid-stream. Fix: version metric definitions and note the version in each run.
Self-check prompts
- Can a teammate recreate your dataset split without contacting you?
- Does your run folder include the exact threshold used for F1/Recall?
- If a library updates next week, can you re-run with the original versions?
- Could you explain a 0.3% AUC change using your saved artifacts and notes?
Practical projects
- Upgrade Repro: Take an old model run you did. Re-create it with a full Run Container. Document gaps and fix them.
- Metric Versioning: Implement v1 and v2 of a metric on the same predictions and show the impact. Save both definitions.
- Mini A/B Replay: Using a synthetic assignment log and events, recompute a conversion lift analysis and compare to a pre-saved result.
Learning path
- Start: Adopt the Minimum Run Record template.
- Next: Add data split indices and environment capture.
- Then: Automate run folder creation and metric logging.
- Team: Standardize naming, metric versions, and checklists.
- Advanced: Add regression tests for metrics and experiment registry conventions.
Who this is for
- Applied Scientists running offline experiments and online A/B tests.
- ML Engineers who need reliable audit trails.
- Data Scientists comparing variants or maintaining dashboards.
Prerequisites
- Basic Python or similar scripting experience.
- Understanding of train/validation/test splits and common metrics.
- Familiarity with version control concepts (commits/tags).
Next steps
- Automate your Run Container creation with a small script in your project template.
- Adopt consistent metric versioning and document metric formulas in each run.
- Set team norms: where runs are stored, naming conventions, and review checklists.
Mini challenge
Pick one recent experiment and try to fully reproduce it from scratch using only your saved materials. Note anything you had to "guess". Update your template so guesses arenβt needed next time.
How progress works
The quick test and exercises are available to everyone. If youβre logged in, your progress and answers are saved so you can continue later.