Who this is for
MLOps engineers, data engineers, and ML practitioners who maintain datasets and need reliable, auditable ways to update labels without breaking reproducibility.
Why this matters
- You will fix mislabels, update class names, and merge/split classes as projects evolve.
- Models trained on corrected labels must be reproducible and comparable to previous versions.
- Auditors and teammates will ask: what changed, why, and can we roll it back?
Concept explained simply
Label versioning tracks how labels change over time, just like code versioning. A label correction is a small, documented edit to the truth you train on. Together, these let you reproduce any past model and safely improve labels.
Mental model
Think of labels as a layered cake: base labels at the bottom, small correction patches stacked on top. Each layer is recorded. You can rebuild the cake to any layer (version), compare slices (metrics), and keep the recipe (metadata) for each change.
Key terms
- Label dataset version: a named snapshot of labels (e.g., labels-v1.3).
- Patch: a small, reviewable set of label edits applied over a base version.
- Schema: the set of classes and rules defining how to label.
- Lineage: pointers to the exact raw data, label version, and training config.
A safe label-correction workflow
- Freeze a base: choose a label version (e.g., labels-v1.2) to patch.
- Propose a patch: collect issues/mislabels, prepare a small change set.
- Review: a second person checks changes; document rationale.
- Apply and version: apply patch, run checks, tag new version (labels-v1.3).
- Retrain and compare: train with v1.3, compare metrics to v1.2.
- Decide: keep or roll back; update changelog.
What to store per version
- Version name, timestamp, author, reason.
- Base version (e.g., v1.2), patch file hash, tool versions.
- Counts of changed items by class/split.
- Any schema changes and the mapping used.
Worked examples
Example 1 — Fix a handful of mislabels with a patch
Suppose labels-v1.2 has 12 images of class "cat" mislabeled as "dog".
- Create a patch file edits.jsonl where each line is a small JSON instruction:
{"id":"img_0102.jpg","from":"dog","to":"cat","reason":"tail/ear shape"} - Apply the patch to base labels.jsonl to produce labels-v1.3.jsonl (tooling can be custom; idea is apply operations deterministically).
- Record metadata: changed=12, reviewer=Alex, guideline=G-2024-07.
- Train model with v1.3 and compare accuracy/F1 to v1.2.
Example 2 — Rename a class (schema change)
Rename "automobile" to "car" without altering meaning.
- Create a schema_map.json:
{"rename": {"automobile": "car"}} - Apply mapping to labels-v2.0 to create labels-v2.1.
- Store map, note that metrics can be compared directly (one-to-one rename).
Example 3 — Split a class into two (schema change)
Split "dog" into "dog_small" and "dog_large" using a threshold (e.g., bbox area).
- Create a split rule file split_rules.json:
{"split": {"dog": {"dog_small": "area < 8000", "dog_large": "area >= 8000"}}} - Apply rules to create labels-v3.0; keep the rule file with the version.
- Note: old-to-new metrics require remapping to compare fairly; define an evaluation mapping for historical comparisons.
Directory and files (one practical structure)
dataset/
images/...
labels/
labels-v1.2.jsonl
labels-v1.3.jsonl
patches/
edits-v1.2-to-v1.3.jsonl
schema/
schema-v1.yaml
schema-v2.yaml
mappings/
rename-automobile-to-car.json
split-dog-small-large.json
splits/
train.txt
val.txt
test.txt
meta/
changelog.md
versions.csv
Minimal changelog entry template
version: labels-v1.3
base: labels-v1.2
date: 2026-01-04
author: Sam Reviewer: Alex
reason: Corrected 12 dog->cat errors found by triage query
artifacts:
patch: patches/edits-v1.2-to-v1.3.jsonl
guideline: G-2024-07
checks:
changed_items: 12
class_counts_delta: {"dog": -12, "cat": +12}
Handling schema changes safely
- Renames: store a rename map; metrics comparable 1:1.
- Merges (A+B -> C): keep a mapping file and note that historic metrics may only be comparable after remapping.
- Splits (A -> A1, A2): define deterministic rules; record them; maintain an evaluation mapping to aggregate A1+A2 to old A when comparing to historic models.
Example mapping file for merge
{
"merge": {
"sedan": "car",
"hatchback": "car"
}
}
Quality checks and safeguards
- Frozen splits: do not silently move items across train/val/test when changing labels. If you must correct test labels, create a new test version (test-v2) and never compare v1 to v2 without noting it.
- Sanity checks: ensure no orphan classes, no empty polygons, valid bbox coordinates, and class distribution deltas make sense.
- Inter-annotator agreement: sample 50 items; double-annotate and compute agreement (e.g., percent agreement, Cohen's kappa) to validate corrections.
- Lineage stamp: for each model, record raw data hash, label version, split version, and training config hash.
Common mistakes
- Silent schema change: renaming classes without a recorded mapping.
- Overwriting labels in place: losing the ability to reproduce past models.
- Mixing data and label changes: not isolating label-only versions, making debugging harder.
- Correcting test labels but comparing metrics to the old test set.
- Non-deterministic patching: unordered or ambiguous patch rules.
Self-check before publishing a new label version
Exercises
Do these hands-on tasks. The quick test at the end is available to everyone; sign in to save your progress.
Exercise 1 — Build and apply a small label patch
Goal: create a patch that flips 5 mislabeled items and outputs a new label version with a mini changelog.
Instructions
- Create a base labels file base.jsonl with 10 items. Make at least 5 intentionally mislabeled (e.g., dog vs cat).
- Create a patch file patch.jsonl with per-line objects: {"id":"...","from":"...","to":"...","reason":"..."}.
- Apply the patch to produce labels-v1.1.jsonl (write a simple deterministic script or process in your tool of choice).
- Write a short changelog entry noting counts before/after.
Expected output
- A file labels-v1.1.jsonl where exactly 5 ids changed.
- Changelog text with changed=5 and correct class count deltas.
Hints
- Sort by id before applying to ensure deterministic results.
- Validate that each patch "from" matches the current label before changing.
Show solution
1) base.jsonl (snippet)
{"id":"img01.jpg","label":"dog"}
{"id":"img02.jpg","label":"dog"}
...
2) patch.jsonl
{"id":"img02.jpg","from":"dog","to":"cat","reason":"ear shape"}
...
3) Apply:
- Read base into dict by id
- For each patch: assert dict[id].label == from; then set to
- Write out labels-v1.1.jsonl sorted by id
4) Changelog:
version: labels-v1.1
base: labels-v1.0
changed_items: 5
class_counts_delta: {"dog": -5, "cat": +5}
Exercise 2 — Rename and split with mappings
Goal: perform a rename (automobile -> car) and a split (dog -> dog_small, dog_large) using rules and produce comparable metrics plan.
Instructions
- Create rename.json: {"rename": {"automobile":"car"}}.
- Create split.json: {"split": {"dog": {"dog_small":"area < 8000","dog_large":"area >= 8000"}}}.
- Apply both to base labels to yield labels-v2.0.
- Write an evaluation remap for comparing old "dog" metrics: dog_small + dog_large -> dog.
Expected output
- labels-v2.0 with updated class names and split dog items.
- evaluation_remap.json documenting how to compare to the old schema.
Hints
- Apply rename before split to avoid mismatches.
- Keep both mapping files alongside the new version.
Show solution
Order:
1) Rename
2) Split with area rule
Artifacts:
- schema/mappings/rename-automobile-to-car.json
- schema/mappings/split-dog-small-large.json
- evaluation_remap.json:
{"aggregate_for_old": {"dog": ["dog_small","dog_large"]}}
Practical projects
- Build a label patch CLI that validates "from" before applying and outputs a delta report.
- Create a schema migration tool that supports rename, merge, and split with a dry-run mode and a class distribution preview.
- Automate a label QA pipeline that runs sanity checks and produces a versioned HTML report per label release.
Learning path
- Start with deterministic patch application and changelogs.
- Add schema migration (rename/merge/split) with mapping files.
- Introduce automated checks and lineage stamps for each model training run.
- Scale to larger datasets with storage-efficient diffs and CI checks.
Prerequisites
- Basic understanding of dataset splits (train/val/test).
- Comfort with JSON/CSV and simple scripting.
- Familiarity with version control concepts (commits, tags).
Next steps
- Integrate label version tags into your training pipelines.
- Add a mandatory review step before publishing new label versions.
- Track evaluation remaps for fair historical comparisons.
Mini challenge
Given labels-v1.5 and a patch that fixes 20 items only in validation, produce labels-v1.6 and a one-page QA summary including: changed count per split, class deltas, and a note on whether comparisons to past validation results remain fair. Keep it deterministic and reproducible.