Why this matters
In computer vision, your model is only as reliable as your data. Versioning and lineage let you reproduce past results, debug mislabeled samples, audit changes, and collaborate safely across teams.
- Reproduce a training set exactly when a regression appears.
- Trace a mislabeled image back to its source and annotation step.
- Branch datasets for experiments without breaking production.
- Audit what changed between v1.2 and v1.3 (e.g., new classes, fixes, filters).
- Meet compliance needs: prove what data trained a shipped model.
Concept explained simply
Think of a dataset like code:
- Version: a frozen, immutable snapshot of data and metadata.
- Lineage: a record of how a version was produced: parents, steps, parameters, and tools.
- Manifest: an index of files with stable IDs (content hashes), labels, and splits.
- Metadata: schema, label map, stats (counts, class balance), quality checks, license.
- Branching: derive new versions from a parent for experiments or releases.
Mental model
Picture a directed acyclic graph (DAG). Each node is a dataset version. Edges are operations (filter duplicates, relabel, augment). You never edit old nodes; you create a new node with a clear recipe. This makes your experiments explainable and repeatable.
Naming and versioning rules (SemVer for datasets)
- MAJOR (2.0.0): breaking changes to labels or schema (e.g., merged/renamed classes, different annotation format).
- MINOR (1.1.0): backward-compatible additions (more images, extra classes without removing old ones, added attributes).
- PATCH (1.0.1): small, compatible fixes (corrected labels, removed obvious duplicates, metadata typo fixes).
Tag releases with immutable names, e.g., cv-cars-v1.3.2. Store the label map and split definitions with the version.
Worked examples
Example 1: From raw frames to v1.0.0
- Input: 12,000 raw frames + initial annotations.
- Ops: deduplicate (perceptual hash), drop blurry (<20 focus score), standardize labels, stratified 80/10/10 split.
- Output: 10,800 images. Stats recorded: per-class counts, mean brightness, duplicate rate.
- Version: v1.0.0 (first stable release; schema defined).
- Lineage: parent = raw dump; steps = [dedupe@p=0.92, blur_filter@thr=20, label_normalize, split@seed=42].
Example 2: Fixing mislabels → v1.0.1
- Issue: 150 sedan images mislabeled as hatchback.
- Action: correct labels only; data files unchanged.
- Version: v1.0.1 (PATCH). Manifest hashes same; annotation entries updated; changelog explains fix.
Example 3: New images + new class → v1.1.0
- Add: 3,000 more images and a new class “van”.
- Backward-compatible? Yes, existing classes intact.
- Version: v1.1.0 (MINOR). Update label map, stats, and splits.
Example 4: Breaking schema change → v2.0.0
Switch from VOC XML to COCO JSON and merge classes “sedan” and “hatchback” → “car”. That breaks old models and tooling. Bump MAJOR.
Start versioning today (no special tools)
- Folder layout
dataset/ releases/ v1.0.0/ manifest.jsonl meta.json splits.json LABEL_MAP.json DATASET.md data/ (immutable copy or content-addressed storage) v1.0.1/ ... workspaces/ (scratch, not published) - Manifest.jsonl (one JSON per line)
{"path":"train/img_000123.jpg","sha256":"3b7f...","width":1920,"height":1080, "labels":[{"bbox":[120,220,300,380],"category":"car"}]}Use a content hash (e.g., SHA-256) as the stable ID. If the file changes, the hash changes.
- meta.json
{ "name":"cv-cars", "version":"1.0.0", "parent":"raw-2025-12-12", "created":"2026-01-05T00:00:00Z", "ops":[ {"op":"dedupe","method":"phash","threshold":0.92}, {"op":"blur_filter","metric":"variance_of_laplacian","threshold":20}, {"op":"label_normalize","map":"LABEL_MAP.json"}, {"op":"split","strategy":"stratified","seed":42,"ratios":{"train":0.8,"val":0.1,"test":0.1}} ], "stats":{ "images":10800, "duplicates_removed":600, "blur_removed":600, "class_counts":{"car":7200,"truck":2100,"bus":1500} } } - splits.json
{"train":8000,"val":1000,"test":1000,"seed":42,"method":"stratified"} - DATASET.md
- Purpose, licensing, data sources, known issues, QA steps, how to cite.
- Changelog: why this release exists and what changed.
- Immutability
- Never edit a published release. Create a new one with a new version.
- Keep work-in-progress in workspaces/; only promote to releases/ after checks pass.
Release checklist
- [ ] Version follows MAJOR.MINOR.PATCH.
- [ ] Manifest has content hashes for 100% of files.
- [ ] Splits stored with seed and method.
- [ ] Label map frozen and saved.
- [ ] Stats and known issues documented.
- [ ] Changelog explains why the bump type is correct.
- [ ] Lineage lists parents and operations with parameters.
Exercises
Do these now, then open the Quick Test at the end.
- Exercise 1: Create a manifest entry and lineage record for a small release. See the Exercises panel for instructions. Use SHA-256 as file ID.
- Exercise 2: Decide whether each change is PATCH, MINOR, or MAJOR and justify.
Common mistakes and self-check
- Editing old releases: breaks reproducibility. Self-check: Can someone re-run an old experiment bit-for-bit? If not, you likely edited an old tag.
- No content hashes: paths rename easily. Self-check: If the file moves, can you still match it by hash?
- Undocumented splits: you cannot reproduce results. Self-check: Is the split seed stored?
- Silent label map changes: leads to training drift. Self-check: Did you bump MAJOR or MINOR and update LABEL_MAP.json?
- Missing parameters in lineage: a recipe without quantities. Self-check: Are filter thresholds and seeds recorded?
Practical projects
Project 1: Productionize a small image dataset
- Scope: 1,000 images, 3 classes.
- Deliverables: v1.0.0 release folder with manifest.jsonl, meta.json, splits.json, LABEL_MAP.json, DATASET.md.
- Acceptance: Recompute stats from manifest and match those in meta.json within 1% tolerance.
Project 2: Bugfix release
- Scope: Correct 50 mislabeled samples in v1.0.0.
- Deliverables: v1.0.1 with changelog and diff report (counts by class before/after).
- Acceptance: Manifest file hashes unchanged; only labels differ.
Project 3: Experimental branch with augmentations
- Scope: Create v1.0.0-aug-a1 (workspace only) with flips and color jitter.
- Deliverables: meta.json lists augmentation params; stats highlight distribution shifts.
- Acceptance: Clear lineage linking back to v1.0.0; not promoted to releases/ until approved.
Mini challenge
Your model trained on v1.1.0 underperforms on nighttime images. Propose a lineage plan to produce a new version that improves nighttime recall without breaking daytime performance. List:
- Which data to add or reweight.
- Ops and parameters (e.g., threshold for brightness filter, augmentation settings).
- Expected bump type and why.
- Three metrics you will compare across versions.
Learning path
- Up next: Data quality and label QA checks.
- Then: Annotation guidelines and reviewer calibration.
- Later: Reproducible pipelines and experiment tracking (MLOps basics).
Who this is for
- Computer Vision Engineers and MLEs working with image/video datasets.
- Data/Annotation leads who manage releases and quality.
- Researchers who need reproducible baselines.
Prerequisites
- Comfort with filesystems and basic command line.
- Understanding of labels/annotations (e.g., bounding boxes, masks).
- Basic JSON reading/writing.
Next steps
- Complete the Exercises below.
- Take the Quick Test to check understanding. The test is available to everyone; only logged-in users get saved progress.
- Apply the checklist to your current dataset and publish a clean v1.0.0.