Topic Not Found

Why this matters

Mental model: Think of your dataset like a published book.

Versioning is the book's edition (1st, 2nd, 3rd) and reprints (minor fixes).
Lineage is the bibliography: the sources and edits that produced this edition.

Jargon to know

Snapshot: a read-only cut of data at a point in time.
Incremental update: adding or modifying a portion of data.
Hash/checksum: fingerprint to detect changes (e.g., SHA-256).
Manifest: machine-readable list of files, hashes, counts, and metadata.
Schema version: version identifier for data fields and label taxonomies.
Lineage graph: nodes (sources/transforms) and edges (operations).
Provenance: where data and labels came from (who, when, how).
Data card: human-readable overview of what the dataset is and is not.
Release notes: change log explaining what changed and why.

A practical 90-minute baseline system

Choose an ID and versioning scheme
- dataset_id: short, stable name (e.g., fraud_txn, cityscapes_small).
- Use semantic versions: MAJOR.MINOR.PATCH.
  - MAJOR: breaking label/schema change.
  - MINOR: new validated data added; model comparability may change.
  - PATCH: bugfixes only (no distribution shift).
Layout your storage
- /datasets/{dataset_id}/{version}/
- Inside: data/, labels/, manifest.json, data_card.md, release_notes.md
Create the manifest
- Include: version, time window, source URIs, schema version, file list with hashes, counts, label stats, known issues, creation timestamp, created_by, random seed.
Hash everything important
- Per-file checksums (e.g., SHA-256) and an overall tree hash in manifest.
- Hash label files separately to detect label-only changes.
Write a short data card and release notes
- Data card: intended use, out-of-scope, collection method, privacy notes, fairness caveats.
- Release notes: what changed since previous version and why.
Lineage capture
- Record pipeline steps: extraction, cleaning, splitting, labeling, QA, dedup.
- For each step: code version (git commit), parameters, and input/output hashes.
Naming conventions
- Tags like: ds=fraud_txn, v=1.4.0, split=train|val|test, slice=region=APAC
- File example: fraud_txn_v1.4.0_train_region=APAC.parquet
Retention and rollback
- Keep at least: last 3 MINOR versions and all MAJOR versions.
- Block deletion of any dataset currently used by a deployed model.

Worked examples

Example 1: New images + a small label fix

Current version: road_signs 1.2.0 (50k images).
Add 5k new images, labels follow same schema. Release 1.3.0.
Find 120 labels swapped in 1.3.0; relabel them only. Release 1.3.1 (PATCH).
Manifest shows identical file hashes except the corrected label files; distribution report changes minimally.

Example 2: Label schema change (breaking)

Old classes: {cat, dog}.
New taxonomy: {cat: {short_hair, long_hair}, dog}.
This breaks comparability. Bump to 2.0.0. Archive mapping rules in release notes.
Lineage records the taxonomy spec version and the mapping script commit.

Example 3: Slice-stable test set

Business wants stable fairness metrics by region.
Create test slices per region from 1.4.0 and freeze them: 1.4.0-test-APAC, 1.4.0-test-EMEA.
When training on 1.5.0, still evaluate on 1.4.0 test slices for trend continuity.
Lineage ties each slice to the parent version and selection query.

Exercises

Do these in writing or your team's notebook. You can check solutions below.

Exercise 1: Design a dataset versioning scheme
Project: binary classification for click fraud. Weekly data, occasional guideline updates.

Deliverables:
- Versioning rules for MAJOR/MINOR/PATCH.
- Folder layout and file naming example.
- Retention and rollback policy.
- Two example release notes lines.
Exercise 2: Draft a manifest and checksum plan
Create a manifest outline for dataset_id: clickspam, version: 1.2.1 including fields for file hashes and label stats. Include a plan to verify integrity on load.

Show hints

Use semantic versioning with clear triggers (schema vs. data vs. fixes).
Keep test slices stable across MINOR bumps to track trends.
Hashes per file plus an overall tree hash help detect partial corruption.
Record pipeline git commit and parameters for lineage.

Exercise solutions

Exercise 1 Solution (example)

Versioning rules:

MAJOR: label guideline change (e.g., new class), schema change.
MINOR: new weekly data added after QA.
PATCH: dedup or relabel fixes without distribution shift.

Layout:

/datasets/clickspam/1.2.1/{data,labels}/
Files like: clickspam_v1.2.1_train_region=NA.parquet

Retention:

Keep last 3 MINOR versions and all MAJOR versions; keep any version tied to deployed models.

Release notes:

1.2.0 -> 1.2.1: Fixed 342 mislabeled rows in week-42; no new data.
1.1.0 -> 1.2.0: Added weeks 41-42; updated dedup threshold from 0.9 to 0.92.

Exercise 2 Solution (example)

manifest:
  dataset_id: clickspam
  version: 1.2.1
  created_at: 2026-01-07T10:00:00Z
  created_by: dataops@example
  sources:
    - s3://ads/clicks/2026/week41
    - s3://ads/clicks/2026/week42
  time_window: 2026-10-05..2026-10-18
  schema_version: clickspam_schema_v3
  label_guideline_version: LGL_v5
  pipeline:
    code_commit: a1b2c3d
    parameters: { dedup_threshold: 0.92, seed: 2026 }
  files:
    - path: data/train.parquet  sha256: abc...
    - path: labels/train.parquet  sha256: def...
  tree_hash: 9f7...
  counts: { train: 120000, val: 20000, test: 20000 }
  label_stats: { positive_rate_train: 0.034, val: 0.035, test: 0.035 }
  known_issues: []
  verification:
    - on load: recompute per-file sha256 and compare to manifest
    - abort if any mismatch; log offending files

Exercise completion checklist

You defined clear MAJOR/MINOR/PATCH triggers.
You proposed a storage layout and naming convention.
Your manifest includes per-file hashes and a tree hash.
You captured pipeline commit and parameters for lineage.
You wrote concise release notes.

Common mistakes and self-checks

Mistake: Bumping MINOR for label schema changes. Self-check: If labels or fields change meaning, bump MAJOR.
Mistake: No frozen test set. Self-check: Maintain stable test slices; only update with a MAJOR release.
Mistake: Not hashing label files separately. Self-check: Store label-only checksums and report label deltas in release notes.
Mistake: Missing pipeline provenance. Self-check: Record code commit, environment, parameters, and seeds for each build.
Mistake: Silent data drift between versions. Self-check: Include summary stats in manifest and diff them in release notes.
Mistake: Deleting versions used in production. Self-check: Block deletion when referenced by deployed artifacts.

Who this is for

Applied Scientists training models that must be reproducible and auditable.
Data/ML Engineers maintaining data pipelines and feature stores.
Labeling leads organizing guidelines and QA.

Prerequisites

Basic understanding of datasets, train/val/test splits.
Comfort with filesystems or object storage conventions.
Familiarity with semantic versioning concepts.

Learning path

Adopt semantic versioning for your current dataset.
Create a minimal manifest and data card for the latest version.
Backfill lineage for the last two releases (sources, commits, params).
Introduce stable test slices and enforce retention rules.
Automate checksums and manifest validation in your data pipeline.

Practical projects

Build a dataset diff report generator that compares two versions and outputs changes in counts, label balance, and file lists.
Create a lineage report for a trained model, resolving back to exact dataset version, pipeline commit, and parameters used.
Implement an integrity verifier that recomputes checksums on load and blocks training if mismatches are found.

Quick test

Take the quick test below to check your understanding. The test is available to everyone; only logged-in users have their progress saved.

Mini challenge

You discover a spike in false positives in LATAM after the last release.

Identify which dataset version trained the deployed model.
Verify whether the LATAM slice changed between the current and previous version.
Decide whether this warrants a PATCH, MINOR, or MAJOR bump and justify.

Suggested approach

Read the model's metadata to find dataset_id/version.
Open both manifests and diff counts and slice definitions for LATAM.
If only labels fixed: PATCH. If new LATAM data added: MINOR. If label taxonomy changed: MAJOR.

Next steps

Automate manifest generation as part of your data build job.
Add a pre-training validation step that checks dataset version pins and hashes.
Start a lightweight data card template and require it for each release.

Menu

Dataset Versioning And Lineage

Table of Contents

Why this matters

A practical 90-minute baseline system

Worked examples

Exercises

Exercise 1 Solution (example)

Exercise 2 Solution (example)

Exercise completion checklist

Common mistakes and self-checks

Who this is for

Prerequisites

Learning path

Practical projects

Quick test

Mini challenge

Next steps

Practice Exercises

Design a dataset versioning scheme

Instructions

Expected Output

Draft a manifest and checksum plan

Dataset Versioning And Lineage — Quick Test

Have questions about Dataset Versioning And Lineage?

AI Assistant