Why this matters
As a Machine Learning Engineer, you will retrain models, debug drifting metrics, and answer audit questions like “Which data and code produced this model?” Without solid data and model versioning, you cannot reliably reproduce results, roll back quickly, or collaborate safely.
- Reproducibility: Re-run a past training job bit-for-bit.
- Traceability: Show exactly which data, code, parameters, and environment produced a model.
- Rollback: Swap a bad production model to a known good one fast.
- Experiments: Compare runs fairly when inputs and code are locked.
Concept explained simply
Versioning means giving every important thing in ML—a dataset snapshot, a feature set, a model artifact, even an inference image—a unique, immutable identity plus metadata. You can always retrieve and reproduce it later.
Mental model
Imagine a library. Each book (dataset/model) has a unique identifier (hash/tag), a card with details (metadata), and a shelf location (storage). You can check out any exact edition again and again. No surprises, no silent changes.
Core building blocks
- Identifiers: content hashes (e.g., SHA256), semantic version tags (e.g., v1.2.0), or commit IDs.
- Storage: artifact stores (local folder, S3/GCS/Azure), Git-LFS/DVC-like remotes for large files.
- Metadata: parameters, metrics, lineage, timestamps, owners, and purpose (staging/production).
- Lineage: a simple graph linking data version + code version + params → model version.
- Immutability: past versions are read-only; changes create new versions.
- Repro recipe: code commit + data snapshot + params + environment (deps/container) = deterministic run.
Quick glossary
- Snapshot: a frozen view of data at a point in time.
- Registry: a catalog that knows model versions, stages, and metadata.
- Artifact: any saved file from ML workflow (dataset, model, metrics, plots, requirements).
Worked examples
Example 1: Version a dataset with hashes and a manifest
Goal: Create a dataset snapshot and lock it with a manifest that captures size and SHA256.
# Folder layout
.
└── data/
└── snapshots/
├── iris_v0.1/
│ └── iris.csv
└── manifest.json
# Compute SHA256 (any language/tool that gives SHA256 is fine)
# macOS/Linux example:
shasum -a 256 data/snapshots/iris_v0.1/iris.csv
# → <sha256> data/snapshots/iris_v0.1/iris.csv
# Write manifest.json (minimal example)
{
"dataset": "iris",
"version": "v0.1",
"files": [
{
"path": "data/snapshots/iris_v0.1/iris.csv",
"size_bytes": 4608,
"sha256": "<sha256>"
}
],
"created_utc": "2026-01-01T00:00:00Z",
"notes": "Cleaned, normalized sepal/petal features"
}
Result: You can verify integrity at any time by re-computing the SHA256 and comparing with the manifest.
Example 2: Model versions with stages (dev → staging → production)
Goal: Keep multiple model versions and mark one as production without deleting old ones.
# Folder layout
models/
iris_clf/
v0.1.0/
model.pkl
metrics.json # {"f1": 0.91, "timestamp": "..."}
params.json # {"seed": 42, "C": 1.0}
v0.2.0/
model.pkl
metrics.json # {"f1": 0.94}
params.json
registry.json
# registry.json (minimal)
{
"name": "iris_clf",
"versions": [
{"version": "v0.1.0", "stage": "archived"},
{"version": "v0.2.0", "stage": "production"}
]
}
Result: Your serving system reads the production tag from registry.json and loads that version. Rollback = change the stage mapping.
Example 3: Repro recipe ties it all together
Goal: Fully reproduce a training run.
# Repro record (yaml/json)
run_id: 2026-01-01-iris-001
code_commit: 9f2e1c4
data_snapshot: iris_v0.1 # points to manifest with SHA256
params:
seed: 42
C: 1.0
penalty: l2
env:
python: 3.10
requirements_lock: sha256:7b1f...
container_image: ghcr.io/org/ml:1.2.3
outputs:
model_version: v0.2.0
metrics:
f1: 0.94
Result: Anyone with access to code commit 9f2e1c4, the iris_v0.1 snapshot, and this environment can recreate model v0.2.0 and the same metrics.
How to implement in a small team
- Pick identifiers: semantic versions for releases (v0.1.0), hashes for integrity.
- Choose storage: a shared folder or object store; keep old versions immutable.
- Create manifests: for each data snapshot and model version.
- Record lineage: code commit + data snapshot + params + env → model version.
- Automate checks: verify SHA256 before training and before deployment.
Minimal folder template you can copy
project/
data/
snapshots/
<name>_vX.Y/
...files
manifest.json
models/
<model_name>/
vX.Y.Z/
model.pkl
metrics.json
params.json
registry.json
runs/
<timestamp-id>.yaml # repro record
code/
...
env/
requirements.lock
Exercises
Hands-on tasks mirror the graded exercises below. Do them locally, then mark your checklist.
- [ ] Exercise 1: Create a dataset snapshot and manifest with SHA256.
- [ ] Exercise 2: Produce two model versions and mark one as production.
- [ ] Exercise 3: Write a complete repro record and verify it end-to-end.
Note: The Quick Test is available to everyone. If you log in, your exercise and test progress will be saved to your profile.
Common mistakes and self-check
- Overwriting data files in place. Self-check: Are old versions still retrievable?
- Missing environment locks. Self-check: Do you have a requirements.lock or container tag?
- Unclear production pointer. Self-check: Is there a single source of truth (registry) for which model serves?
- No integrity verification. Self-check: Do you re-hash files before training/deploying?
- Mixing dev experiments with releases. Self-check: Do only tagged versions reach staging/production?
How to recover if you already overwrote files
Create a new snapshot from your current state, hash it, and freeze it. Then update your processes to forbid in-place edits.
Mini tasks (5–10 minutes)
- Create a semantic versioning policy: what changes bump major/minor/patch?
- Add a checksum verification step to your training script.
- Write a short model card template (inputs, training data version, metrics, intended use).
Practical projects
- Build a lightweight model registry using JSON files and folders. Support stages and comments.
- Create a data snapshotter script that walks a folder, computes SHA256, writes/updates manifest.json, and verifies integrity.
- Automate a CI step: on new tag vX.Y.Z, verify data snapshot hash, run training, produce artifacts, and update registry.json.
Who this is for
Engineers and data scientists who train, evaluate, or ship ML models and need reproducibility and safe rollbacks.
Prerequisites
- Basic command line usage.
- Familiarity with Git concepts (commits, tags).
- Ability to run Python and install packages.
Learning path
- Before: Source control fundamentals, clean data pipelines.
- Now: Data and model versioning (this lesson).
- Next: Experiment tracking, deployment workflows, monitoring and rollback procedures.
Next steps
- Finish the exercises and verify your manifests and registry work.
- Take the Quick Test to check your understanding.
- Integrate version checks into your team’s training and deployment scripts.
Mini challenge
Given a failing production model, demonstrate a rollback in under 5 minutes using your registry.json. Document the exact steps and verify that monitoring reflects the new (old) version.