Why this matters
As a Machine Learning Engineer, you make many small changes—hyperparameters, features, data slices, seeds, and code tweaks. Without systematic experiment tracking, you cannot answer basic questions like: What changed? Did it help? Can we reproduce it? Good tracking turns guesses into decisions and lets teams collaborate safely.
- Real tasks: compare model versions across datasets, audit a production incident, hand over experiments to teammates, reproduce a result from months ago.
- Benefits: faster iteration, fewer regressions, reliable handoffs, and clear evidence for stakeholders.
Concept explained simply
Experiment tracking is a structured logbook for ML work. Each run records what you tried, what happened, and the files produced.
Mental model
Think of a lab notebook: each page is a run. You write inputs (parameters, data version, code commit), observations (metrics), and attach artifacts (plots, models). You label pages so you can find and compare them later.
Core terms
- Experiment: a collection of related runs (e.g., "ResNet sweep").
- Run: one execution with a defined configuration and results.
- Parameters (params): inputs you set (learning_rate, max_depth, seed, feature_set).
- Metrics: outcomes you measure (accuracy, F1, AUC, loss).
- Artifacts: files produced (model.pkl, confusion_matrix.png, requirements.txt).
- Tags/Notes: labels and free text for filtering ("baseline", "ablation:drop_color").
- Lineage/Context: code version (commit hash), data version/snapshot, environment (Python, libs), hardware.
- Reproducibility: ability to rerun and obtain identical or statistically consistent results.
Minimal workflow to track reliably
- Create an experiment: choose a clear name and purpose.
- Define parameters: log all knobs you control (including seed and data split config).
- Log metrics: track per-epoch and final metrics; include standard deviation for multiple seeds.
- Save artifacts: save model, plots, and the exact training/validation indices if possible.
- Capture context: log git commit, data version/hash, and environment (package list).
- Tag and review: tag best candidates, add a short rationale note, and compare runs side by side.
What to log at minimum (practical checklist)
- Params: model_type, learning_rate, batch_size, epochs, seed
- Data: dataset_name, split_strategy, data_version or hash
- Code: git_commit, script_name
- Metrics: train_loss_last, val_metric_main (e.g., f1_macro), evaluation_timestamp
- Artifacts: model file, metrics.csv, key plots
- Environment: python_version, pip/conda freeze
- Tags: purpose (baseline/sweep/ablation), owner, ticket or task id
Worked examples
1) Hyperparameter sweep for learning rate
Goal: find the best learning rate for a fixed architecture.
- Experiment: "cnn_lr_sweep"
- Params per run: learning_rate in [1e-4, 3e-4, 1e-3], seed=42, architecture=cnn_v3
- Metrics: val_accuracy, val_loss (log per epoch + final)
- Artifacts: loss_curve.png, best_model.pt
- Tags: sweep, lr
Compare runs by final val_accuracy; cross-check if the best run also has a stable loss curve. Record a short note: "1e-3 diverged after epoch 8."
2) Ablation study: removing a feature group
Goal: understand contribution of color features.
- Experiment: "ablation_color_features"
- Params: feature_set=[all, no_color], seed in [0,1,2]
- Metrics: f1_macro (report mean±std across seeds)
- Artifacts: confusion_matrix.png per variant; table.csv summarizing seed runs
- Tags: ablation
Decision: if mean F1 drops by >1.5 points when removing color, keep color features.
3) Data preprocessing change
Goal: switch normalization from per-feature to global.
- Params: normalizer={per_feature|global}, seed=123, model=logreg
- Context: data_version=v2025-05-01, git_commit=abc1234
- Metrics: auc_pr, calibration_error
- Artifacts: calibration_plot.png, model.pkl
Outcome: global normalization slightly improves calibration; note rationale and tag candidate for deployment.
Data and code lineage
Reproducibility depends on knowing exactly which code and data were used.
- Code: record git commit and any uncommitted diffs summary.
- Data: record dataset name, version, or content hash; for sampled/split data, save indices used.
- Environment: record exact package versions and OS/CPU/GPU info.
Minimal lineage template you can paste into your run notes
{
"git_commit": "",
"script": "train.py",
"data": {"name": "dataset_X", "version": "v1.2", "split_seed": 2024},
"env": {"python": "3.10.12", "framework": "torch==2.2.1", "cuda": "12.1"}
} Naming conventions and metadata
- Experiment names: purpose_prefix + topic (e.g., "sweep_lr_cnn_v3").
- Run names: short param key-values (e.g., "lr=0.001_seed=0").
- Metric names: consistent, lowercase with underscores ("val_f1_macro").
- Tags: lifecycle (baseline/sweep/ablation/candidate), area (data/model/eval), owner.
Quick naming do/don't
- Do: val_f1_macro, test_auc_roc
- Don't: F1, AUCtest (ambiguous)
- Do: data_version=v2025-05-01
- Don't: "latest"
Common mistakes and how to self-check
- Not logging seed or split config. Self-check: can you recreate the exact train/val split?
- Inconsistent metric naming. Self-check: can a script aggregate runs without manual renaming?
- Forgetting data version. Self-check: is the dataset version/hash present for every run?
- Manual copy-paste of results. Self-check: can the run be parsed programmatically from a single source?
- Keeping only the best model. Self-check: are intermediate artifacts and learning curves preserved?
- Mixing offline/online metrics. Self-check: label metrics with context (offline_cv, shadow_prod).
Self-audit mini checklist
- Each run has params, metrics, artifacts, code, data, env, tags
- Metric names are consistent across runs
- There is a short note explaining purpose and outcome
- Runs can be grouped and compared by a clear experiment name
Exercises
Use these to practice. A quick test is available at the end. Anyone can take it; login enables saving progress.
- Exercise 1: Design a tracking plan for a new project (see details below).
- Exercise 2: Convert messy logs into structured run records.
- Exercise 3: Reproducibility drill: find missing lineage.
Exercise 1 details
Your team will train a gradient boosting model on tabular data with heavy feature engineering. Propose what to track under params, metrics, artifacts, and context. Include naming conventions and at least three useful tags.
Exercise 2 details
You receive a text log with intermixed prints. Normalize it into params, metrics (with steps), and artifacts entries. Identify any missing context you would add.
Exercise 3 details
You must reproduce a run that reports improved ROC AUC but no one can match it. Inspect a provided metadata snippet and list what is missing or ambiguous. Propose a corrected run record.
Practical projects
- Personal benchmark: pick a small dataset and track a baseline, a sweep, and an ablation. Deliver a one-page summary with decisions.
- Repro kit: a script that loads a run id and rebuilds the environment, data split, and model to re-evaluate test metrics.
- Metric standardizer: a small utility that enforces naming and logs derived metrics (mean±std across seeds).
Mini challenge
In 20 minutes, design an experiment to compare three data augmentations. Define:
- Experiment name and tags
- Run params (including seed strategy)
- Metrics to compare (primary and secondary)
- Artifacts to save
- Decision rule to pick a winner
Sample answer
experiment: aug_compare_v1 params: model=resnet18, aug in [flip, color_jitter, cutout], seeds=[0,1,2] metrics: val_f1_macro (primary), val_accuracy, train_time artifacts: confusion_matrix.png, loss_curve.png, best_model.pt context: git_commit, data_version, env freeze rule: select highest mean val_f1 with std <= 0.5; tie-break by accuracy
Who this is for
- Machine Learning Engineers who want reliable, collaborative experimentation.
- Data Scientists preparing results for production.
- Students building repeatable ML projects.
Prerequisites
- Basic ML training workflow knowledge (fit, evaluate, save model).
- Comfort with Python and version control basics.
- Understanding of common metrics for your problem type.
Learning path
- Start: Experiment tracking concepts (this page).
- Next: Data versioning and lineage in practice.
- Then: Reproducible environments (conda/pip, containers).
- Finally: Model registry and deployment promotion criteria.
Next steps
- Complete the exercises below and check solutions.
- Take the quick test to confirm understanding. Note: anyone can take it; sign in if you want your progress saved.
- Apply tracking to one active project this week and run a small sweep with proper lineage.