How to learn Dataset Snapshots And Manifests for Data And Model Versioning in MLOps Engineer for free

Why this matters

In MLOps, you will be asked to reproduce a training run, compare model performance across data versions, audit what changed, or roll back quickly. Dataset snapshots and manifests give you:

Reproducibility: exact file set and hashes captured at a point in time.
Auditability: who, when, what changed in the data.
Safety: immutable tags to roll back or pin training jobs.
Efficiency: deterministic caching and fast diffs between versions.

Concept explained simply

A dataset snapshot is a frozen, named point-in-time version of your dataset. A manifest is a file (often JSON or YAML) that lists every file in the snapshot with its path, size, and checksum hash. Together they make your dataset reproducible and verifiable.

Mental model

Think of a snapshot as a photo album (immutable tag), and the manifest as the album index listing every photo with a fingerprint (hash). If any photo changes, its fingerprint changes, and you immediately see the difference.

Core definitions

Snapshot: Immutable tag pointing to an exact set of files and their checksums.
Manifest: Machine-readable inventory of files with fields like path, size_bytes, checksum (e.g., sha256), optional metadata (schema version, splits, stats).
Checksum/Hash: A deterministic fingerprint of file contents (e.g., sha256). Any change alters the hash.
Content-addressing: Using the hash to identify content. Same content, same hash.
Lineage: Links from a snapshot to its data sources, transforms, and code commit.

Worked examples

Example 1: Minimal manifest for a tabular dataset

Suppose you have two partitions:

data/raw/transactions_2024-09.csv
data/raw/transactions_2024-10.csv

Create manifest.json:

{
  "dataset_name": "transactions",
  "schema_version": "1.0",
  "created_at": "2024-10-15T10:20:00Z",
  "created_by": "mlops@example",
  "files": [
    {"path": "data/raw/transactions_2024-09.csv", "size_bytes": 10485760, "sha256": "<hash1>"},
    {"path": "data/raw/transactions_2024-10.csv", "size_bytes": 15728640, "sha256": "<hash2>"}
  ],
  "stats": {"file_count": 2, "total_size_bytes": 26214400}
}

Now anyone can verify the dataset by recomputing sizes and hashes and comparing with the manifest.

Example 2: Snapshot naming and metadata

Use a clear, immutable tag like transactions-2024-10-15.v1. Store snapshot.json:

{
  "snapshot": "transactions-2024-10-15.v1",
  "manifest_path": "manifests/transactions-2024-10-15.v1.json",
  "source_code_commit": "a1b2c3d",
  "data_source": ["s3://prod-raw/transactions/2024-09", "s3://prod-raw/transactions/2024-10"],
  "notes": "Initial October merge"
}

Keep the manifest immutable; any change produces a new snapshot tag (e.g., .v2).

Example 3: Manifest diff to see what changed

Given two manifests A and B, diff them by path and sha256:

{
  "added": ["data/raw/transactions_2024-10.csv"],
  "removed": [],
  "modified": [
    {"path": "data/raw/transactions_2024-09.csv", "old_sha256": "<h1>", "new_sha256": "<h2>"}
  ],
  "unchanged": ["readme.md"],
  "summary": {"added": 1, "removed": 0, "modified": 1, "unchanged": 1}
}

This summary is perfect for PR review and audit logs.

How to compute hashes safely

Use a strong hash (sha256). Avoid weak hashes for audit-grade workflows.
Hash file contents only, not file names.
Stream large files to avoid running out of memory.

Step-by-step: Build a reliable snapshot + manifest

Step 1 — Normalize structure

Organize files in stable directories (e.g., data/train, data/val, data/test). Avoid spaces and unstable temp folders.

Step 2 — Compute checksums

For each file: compute sha256 and record size in bytes. Ignore transient files (.DS_Store, logs).

Step 3 — Write manifest

Create manifest.json with dataset_name, schema_version, created_at, created_by, files[], optional splits[] and stats.

Step 4 — Tag the snapshot

Choose a clear tag: <name>-YYYY-MM-DD.vN (e.g., images-2024-10-15.v1). Never mutate it later.

Step 5 — Store and lock

Store manifest alongside data or in a registry folder. Mark as read-only or protect with reviews.

Step 6 — Verify

Re-hash a random 5–10% of files, compare with manifest. On CI, verify all files for release candidates.

Step 7 — Document lineage

Record source URIs and code commit that produced the snapshot for traceability and audits.

Validation and self-check

Does every listed file exist, with exact size and sha256 match?
Are there any extra files on disk not listed in the manifest?
Is the snapshot tag immutable and descriptive?
Can another person reproduce the manifest on a fresh machine?

Quick manual verification recipe

# Pseudo-commands
# 1) List files deterministically
find data -type f | sort > files.txt
# 2) Compute sha256 for each
while read f; do sha256sum "$f"; done < files.txt > checksums.txt
# 3) Compare against manifest entries (path + sha256)

Exercises

Complete these in your local environment. You can use any scripting language or command-line tools you prefer.

Exercise 1 — Build a deterministic dataset manifest (sha256 + sizes)

Target: create manifest.json for a dataset with folders data/train and data/val.
Ignore temporary files like .DS_Store and hidden thumbnails.
Include dataset_name, schema_version, created_at, files[], and stats.

Exercise 2 — Create two immutable snapshots and produce a diff report

Make two manifests (v1 and v2) where v2 adds one file and modifies one file.
Produce a JSON diff with added/removed/modified/unchanged and a summary.

Checklist before you submit (both exercises)

All file paths are relative and consistent.
No untracked files left in the dataset directories.
sha256 hashes recompute to the same values.
Versions are immutable (no editing of v1 after creating v2).

Common mistakes and how to self-check

Mutating a snapshot after publishing. Fix: treat snapshots as read-only; create a new version for any change.
Relying on timestamps instead of hashes. Fix: always use content hashes for determinism.
Mixing absolute paths in manifests. Fix: store relative paths from dataset root to keep portability.
Ignoring hidden/transient files. Fix: define explicit include/exclude rules and document them.
Hashing compressed archives that are re-packed. Fix: prefer hashing actual file contents or use deterministic compression settings.

Self-check routine

Re-run your hashing on a different machine or container and compare outputs.
Run a manifest linter: check required fields, types, and that counts match reality.

Practical projects

Small: Create a manifest for a 1,000-image classification dataset with train/val/test splits.
Medium: Add basic stats to your manifest (row_count, missing_values per column) and validate them in CI.
Large: Build a snapshot registry folder with 5 sequential versions and a script that generates a human-readable CHANGELOG from manifest diffs.

Who this is for, prerequisites, and next steps

Who this is for

MLOps Engineers ensuring reproducible ML pipelines.
Data Engineers curating datasets for training and evaluation.
ML Researchers needing reliable data version control.

Prerequisites

Comfort with filesystems and command-line basics.
Basic understanding of hashing and JSON/YAML.
Familiarity with versioning mindset (immutability, tagging).

Next steps

Integrate manifest verification into CI before training runs.
Add lineage fields (source URIs, code commit) to every snapshot.
Automate diff reporting and gate merges on review of data changes.

Mini challenge

Design a manifest schema extension to capture data splits and label distribution for a classification dataset. Implement it on a sample dataset and show a short JSON snippet with per-split counts and label histogram. Acceptance: schema is documented in the manifest, and totals across splits match the overall count.

Quick Test and progress

Take the quick test below. Everyone can access it; if you are logged in, your progress will be saved automatically.

Menu

Dataset Snapshots And Manifests

Table of Contents