luvv to helpDiscover the Best Free Online Tools
Topic 4 of 9

Data And Model Versioning

Learn Data And Model Versioning for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you will retrain models, debug drifting metrics, and answer audit questions like “Which data and code produced this model?” Without solid data and model versioning, you cannot reliably reproduce results, roll back quickly, or collaborate safely.

  • Reproducibility: Re-run a past training job bit-for-bit.
  • Traceability: Show exactly which data, code, parameters, and environment produced a model.
  • Rollback: Swap a bad production model to a known good one fast.
  • Experiments: Compare runs fairly when inputs and code are locked.

Concept explained simply

Versioning means giving every important thing in ML—a dataset snapshot, a feature set, a model artifact, even an inference image—a unique, immutable identity plus metadata. You can always retrieve and reproduce it later.

Mental model

Imagine a library. Each book (dataset/model) has a unique identifier (hash/tag), a card with details (metadata), and a shelf location (storage). You can check out any exact edition again and again. No surprises, no silent changes.

Core building blocks

  • Identifiers: content hashes (e.g., SHA256), semantic version tags (e.g., v1.2.0), or commit IDs.
  • Storage: artifact stores (local folder, S3/GCS/Azure), Git-LFS/DVC-like remotes for large files.
  • Metadata: parameters, metrics, lineage, timestamps, owners, and purpose (staging/production).
  • Lineage: a simple graph linking data version + code version + params → model version.
  • Immutability: past versions are read-only; changes create new versions.
  • Repro recipe: code commit + data snapshot + params + environment (deps/container) = deterministic run.
Quick glossary
  • Snapshot: a frozen view of data at a point in time.
  • Registry: a catalog that knows model versions, stages, and metadata.
  • Artifact: any saved file from ML workflow (dataset, model, metrics, plots, requirements).

Worked examples

Example 1: Version a dataset with hashes and a manifest

Goal: Create a dataset snapshot and lock it with a manifest that captures size and SHA256.

# Folder layout
.
└── data/
    └── snapshots/
        ├── iris_v0.1/
        │   └── iris.csv
        └── manifest.json

# Compute SHA256 (any language/tool that gives SHA256 is fine)
# macOS/Linux example:
shasum -a 256 data/snapshots/iris_v0.1/iris.csv
# → <sha256>  data/snapshots/iris_v0.1/iris.csv

# Write manifest.json (minimal example)
{
  "dataset": "iris",
  "version": "v0.1",
  "files": [
    {
      "path": "data/snapshots/iris_v0.1/iris.csv",
      "size_bytes": 4608,
      "sha256": "<sha256>"
    }
  ],
  "created_utc": "2026-01-01T00:00:00Z",
  "notes": "Cleaned, normalized sepal/petal features"
}

Result: You can verify integrity at any time by re-computing the SHA256 and comparing with the manifest.

Example 2: Model versions with stages (dev → staging → production)

Goal: Keep multiple model versions and mark one as production without deleting old ones.

# Folder layout
models/
  iris_clf/
    v0.1.0/
      model.pkl
      metrics.json  # {"f1": 0.91, "timestamp": "..."}
      params.json   # {"seed": 42, "C": 1.0}
    v0.2.0/
      model.pkl
      metrics.json  # {"f1": 0.94}
      params.json
    registry.json

# registry.json (minimal)
{
  "name": "iris_clf",
  "versions": [
    {"version": "v0.1.0", "stage": "archived"},
    {"version": "v0.2.0", "stage": "production"}
  ]
}

Result: Your serving system reads the production tag from registry.json and loads that version. Rollback = change the stage mapping.

Example 3: Repro recipe ties it all together

Goal: Fully reproduce a training run.

# Repro record (yaml/json)
run_id: 2026-01-01-iris-001
code_commit: 9f2e1c4
data_snapshot: iris_v0.1  # points to manifest with SHA256
params:
  seed: 42
  C: 1.0
  penalty: l2
env:
  python: 3.10
  requirements_lock: sha256:7b1f...
  container_image: ghcr.io/org/ml:1.2.3
outputs:
  model_version: v0.2.0
  metrics:
    f1: 0.94

Result: Anyone with access to code commit 9f2e1c4, the iris_v0.1 snapshot, and this environment can recreate model v0.2.0 and the same metrics.

How to implement in a small team

  1. Pick identifiers: semantic versions for releases (v0.1.0), hashes for integrity.
  2. Choose storage: a shared folder or object store; keep old versions immutable.
  3. Create manifests: for each data snapshot and model version.
  4. Record lineage: code commit + data snapshot + params + env → model version.
  5. Automate checks: verify SHA256 before training and before deployment.
Minimal folder template you can copy
project/
  data/
    snapshots/
      <name>_vX.Y/
        ...files
      manifest.json
  models/
    <model_name>/
      vX.Y.Z/
        model.pkl
        metrics.json
        params.json
      registry.json
  runs/
    <timestamp-id>.yaml  # repro record
  code/
    ...
  env/
    requirements.lock

Exercises

Hands-on tasks mirror the graded exercises below. Do them locally, then mark your checklist.

  • [ ] Exercise 1: Create a dataset snapshot and manifest with SHA256.
  • [ ] Exercise 2: Produce two model versions and mark one as production.
  • [ ] Exercise 3: Write a complete repro record and verify it end-to-end.

Note: The Quick Test is available to everyone. If you log in, your exercise and test progress will be saved to your profile.

Common mistakes and self-check

  • Overwriting data files in place. Self-check: Are old versions still retrievable?
  • Missing environment locks. Self-check: Do you have a requirements.lock or container tag?
  • Unclear production pointer. Self-check: Is there a single source of truth (registry) for which model serves?
  • No integrity verification. Self-check: Do you re-hash files before training/deploying?
  • Mixing dev experiments with releases. Self-check: Do only tagged versions reach staging/production?
How to recover if you already overwrote files

Create a new snapshot from your current state, hash it, and freeze it. Then update your processes to forbid in-place edits.

Mini tasks (5–10 minutes)

  1. Create a semantic versioning policy: what changes bump major/minor/patch?
  2. Add a checksum verification step to your training script.
  3. Write a short model card template (inputs, training data version, metrics, intended use).

Practical projects

  • Build a lightweight model registry using JSON files and folders. Support stages and comments.
  • Create a data snapshotter script that walks a folder, computes SHA256, writes/updates manifest.json, and verifies integrity.
  • Automate a CI step: on new tag vX.Y.Z, verify data snapshot hash, run training, produce artifacts, and update registry.json.

Who this is for

Engineers and data scientists who train, evaluate, or ship ML models and need reproducibility and safe rollbacks.

Prerequisites

  • Basic command line usage.
  • Familiarity with Git concepts (commits, tags).
  • Ability to run Python and install packages.

Learning path

  • Before: Source control fundamentals, clean data pipelines.
  • Now: Data and model versioning (this lesson).
  • Next: Experiment tracking, deployment workflows, monitoring and rollback procedures.

Next steps

  • Finish the exercises and verify your manifests and registry work.
  • Take the Quick Test to check your understanding.
  • Integrate version checks into your team’s training and deployment scripts.

Mini challenge

Given a failing production model, demonstrate a rollback in under 5 minutes using your registry.json. Document the exact steps and verify that monitoring reflects the new (old) version.

Practice Exercises

3 exercises to complete

Instructions

  1. Create a folder data/snapshots/iris_v0.1 and place a CSV file named iris.csv inside (use any small CSV if you do not have iris).
  2. Compute the SHA256 of data/snapshots/iris_v0.1/iris.csv using a tool available on your OS (e.g., shasum -a 256 on macOS/Linux or a short Python script).
  3. Create data/snapshots/manifest.json with keys: dataset, version, files (list of objects with path, size_bytes, sha256), created_utc, notes.
  4. Re-compute the checksum and confirm it matches the manifest.
Expected Output
A manifest.json file that lists iris.csv with correct size and SHA256. Re-hashing the file produces exactly the same checksum recorded in the manifest.

Data And Model Versioning — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data And Model Versioning?

AI Assistant

Ask questions about this tool