Why this matters
In MLOps, you will be asked to reproduce a training run, compare model performance across data versions, audit what changed, or roll back quickly. Dataset snapshots and manifests give you:
- Reproducibility: exact file set and hashes captured at a point in time.
- Auditability: who, when, what changed in the data.
- Safety: immutable tags to roll back or pin training jobs.
- Efficiency: deterministic caching and fast diffs between versions.
Concept explained simply
A dataset snapshot is a frozen, named point-in-time version of your dataset. A manifest is a file (often JSON or YAML) that lists every file in the snapshot with its path, size, and checksum hash. Together they make your dataset reproducible and verifiable.
Mental model
Think of a snapshot as a photo album (immutable tag), and the manifest as the album index listing every photo with a fingerprint (hash). If any photo changes, its fingerprint changes, and you immediately see the difference.
Core definitions
- Snapshot: Immutable tag pointing to an exact set of files and their checksums.
- Manifest: Machine-readable inventory of files with fields like path, size_bytes, checksum (e.g., sha256), optional metadata (schema version, splits, stats).
- Checksum/Hash: A deterministic fingerprint of file contents (e.g., sha256). Any change alters the hash.
- Content-addressing: Using the hash to identify content. Same content, same hash.
- Lineage: Links from a snapshot to its data sources, transforms, and code commit.
Worked examples
Example 1: Minimal manifest for a tabular dataset
Suppose you have two partitions:
data/raw/transactions_2024-09.csv
data/raw/transactions_2024-10.csvCreate manifest.json:
{
"dataset_name": "transactions",
"schema_version": "1.0",
"created_at": "2024-10-15T10:20:00Z",
"created_by": "mlops@example",
"files": [
{"path": "data/raw/transactions_2024-09.csv", "size_bytes": 10485760, "sha256": "<hash1>"},
{"path": "data/raw/transactions_2024-10.csv", "size_bytes": 15728640, "sha256": "<hash2>"}
],
"stats": {"file_count": 2, "total_size_bytes": 26214400}
}Now anyone can verify the dataset by recomputing sizes and hashes and comparing with the manifest.
Example 2: Snapshot naming and metadata
Use a clear, immutable tag like transactions-2024-10-15.v1. Store snapshot.json:
{
"snapshot": "transactions-2024-10-15.v1",
"manifest_path": "manifests/transactions-2024-10-15.v1.json",
"source_code_commit": "a1b2c3d",
"data_source": ["s3://prod-raw/transactions/2024-09", "s3://prod-raw/transactions/2024-10"],
"notes": "Initial October merge"
}Keep the manifest immutable; any change produces a new snapshot tag (e.g., .v2).
Example 3: Manifest diff to see what changed
Given two manifests A and B, diff them by path and sha256:
{
"added": ["data/raw/transactions_2024-10.csv"],
"removed": [],
"modified": [
{"path": "data/raw/transactions_2024-09.csv", "old_sha256": "<h1>", "new_sha256": "<h2>"}
],
"unchanged": ["readme.md"],
"summary": {"added": 1, "removed": 0, "modified": 1, "unchanged": 1}
}This summary is perfect for PR review and audit logs.
How to compute hashes safely
- Use a strong hash (sha256). Avoid weak hashes for audit-grade workflows.
- Hash file contents only, not file names.
- Stream large files to avoid running out of memory.
Step-by-step: Build a reliable snapshot + manifest
Step 1 — Normalize structure
Organize files in stable directories (e.g., data/train, data/val, data/test). Avoid spaces and unstable temp folders.
Step 2 — Compute checksums
For each file: compute sha256 and record size in bytes. Ignore transient files (.DS_Store, logs).
Step 3 — Write manifest
Create manifest.json with dataset_name, schema_version, created_at, created_by, files[], optional splits[] and stats.
Step 4 — Tag the snapshot
Choose a clear tag: <name>-YYYY-MM-DD.vN (e.g., images-2024-10-15.v1). Never mutate it later.
Step 5 — Store and lock
Store manifest alongside data or in a registry folder. Mark as read-only or protect with reviews.
Step 6 — Verify
Re-hash a random 5–10% of files, compare with manifest. On CI, verify all files for release candidates.
Step 7 — Document lineage
Record source URIs and code commit that produced the snapshot for traceability and audits.
Validation and self-check
- Does every listed file exist, with exact size and sha256 match?
- Are there any extra files on disk not listed in the manifest?
- Is the snapshot tag immutable and descriptive?
- Can another person reproduce the manifest on a fresh machine?
Quick manual verification recipe
# Pseudo-commands
# 1) List files deterministically
find data -type f | sort > files.txt
# 2) Compute sha256 for each
while read f; do sha256sum "$f"; done < files.txt > checksums.txt
# 3) Compare against manifest entries (path + sha256)Exercises
Complete these in your local environment. You can use any scripting language or command-line tools you prefer.
Exercise 1 — Build a deterministic dataset manifest (sha256 + sizes)
- Target: create manifest.json for a dataset with folders data/train and data/val.
- Ignore temporary files like .DS_Store and hidden thumbnails.
- Include dataset_name, schema_version, created_at, files[], and stats.
Exercise 2 — Create two immutable snapshots and produce a diff report
- Make two manifests (v1 and v2) where v2 adds one file and modifies one file.
- Produce a JSON diff with added/removed/modified/unchanged and a summary.
Checklist before you submit (both exercises)
- All file paths are relative and consistent.
- No untracked files left in the dataset directories.
- sha256 hashes recompute to the same values.
- Versions are immutable (no editing of v1 after creating v2).
Common mistakes and how to self-check
- Mutating a snapshot after publishing. Fix: treat snapshots as read-only; create a new version for any change.
- Relying on timestamps instead of hashes. Fix: always use content hashes for determinism.
- Mixing absolute paths in manifests. Fix: store relative paths from dataset root to keep portability.
- Ignoring hidden/transient files. Fix: define explicit include/exclude rules and document them.
- Hashing compressed archives that are re-packed. Fix: prefer hashing actual file contents or use deterministic compression settings.
Self-check routine
- Re-run your hashing on a different machine or container and compare outputs.
- Run a manifest linter: check required fields, types, and that counts match reality.
Practical projects
- Small: Create a manifest for a 1,000-image classification dataset with train/val/test splits.
- Medium: Add basic stats to your manifest (row_count, missing_values per column) and validate them in CI.
- Large: Build a snapshot registry folder with 5 sequential versions and a script that generates a human-readable CHANGELOG from manifest diffs.
Who this is for, prerequisites, and next steps
Who this is for
- MLOps Engineers ensuring reproducible ML pipelines.
- Data Engineers curating datasets for training and evaluation.
- ML Researchers needing reliable data version control.
Prerequisites
- Comfort with filesystems and command-line basics.
- Basic understanding of hashing and JSON/YAML.
- Familiarity with versioning mindset (immutability, tagging).
Next steps
- Integrate manifest verification into CI before training runs.
- Add lineage fields (source URIs, code commit) to every snapshot.
- Automate diff reporting and gate merges on review of data changes.
Mini challenge
Design a manifest schema extension to capture data splits and label distribution for a classification dataset. Implement it on a sample dataset and show a short JSON snippet with per-split counts and label histogram. Acceptance: schema is documented in the manifest, and totals across splits match the overall count.
Quick Test and progress
Take the quick test below. Everyone can access it; if you are logged in, your progress will be saved automatically.