Topic Not Found

Why this matters

As a Machine Learning Engineer, you will retrain models, debug drifting metrics, and answer audit questions like “Which data and code produced this model?” Without solid data and model versioning, you cannot reliably reproduce results, roll back quickly, or collaborate safely.

Reproducibility: Re-run a past training job bit-for-bit.
Traceability: Show exactly which data, code, parameters, and environment produced a model.
Rollback: Swap a bad production model to a known good one fast.
Experiments: Compare runs fairly when inputs and code are locked.

Concept explained simply

Versioning means giving every important thing in ML—a dataset snapshot, a feature set, a model artifact, even an inference image—a unique, immutable identity plus metadata. You can always retrieve and reproduce it later.

Mental model

Imagine a library. Each book (dataset/model) has a unique identifier (hash/tag), a card with details (metadata), and a shelf location (storage). You can check out any exact edition again and again. No surprises, no silent changes.

Core building blocks

Identifiers: content hashes (e.g., SHA256), semantic version tags (e.g., v1.2.0), or commit IDs.
Storage: artifact stores (local folder, S3/GCS/Azure), Git-LFS/DVC-like remotes for large files.
Metadata: parameters, metrics, lineage, timestamps, owners, and purpose (staging/production).
Lineage: a simple graph linking data version + code version + params → model version.
Immutability: past versions are read-only; changes create new versions.
Repro recipe: code commit + data snapshot + params + environment (deps/container) = deterministic run.

Quick glossary

Snapshot: a frozen view of data at a point in time.
Registry: a catalog that knows model versions, stages, and metadata.
Artifact: any saved file from ML workflow (dataset, model, metrics, plots, requirements).

Worked examples

Example 1: Version a dataset with hashes and a manifest

Goal: Create a dataset snapshot and lock it with a manifest that captures size and SHA256.

# Folder layout
.
└── data/
    └── snapshots/
        ├── iris_v0.1/
        │   └── iris.csv
        └── manifest.json

# Compute SHA256 (any language/tool that gives SHA256 is fine)
# macOS/Linux example:
shasum -a 256 data/snapshots/iris_v0.1/iris.csv
# → <sha256>  data/snapshots/iris_v0.1/iris.csv

# Write manifest.json (minimal example)
{
  "dataset": "iris",
  "version": "v0.1",
  "files": [
    {
      "path": "data/snapshots/iris_v0.1/iris.csv",
      "size_bytes": 4608,
      "sha256": "<sha256>"
    }
  ],
  "created_utc": "2026-01-01T00:00:00Z",
  "notes": "Cleaned, normalized sepal/petal features"
}

Result: You can verify integrity at any time by re-computing the SHA256 and comparing with the manifest.

Example 2: Model versions with stages (dev → staging → production)

Goal: Keep multiple model versions and mark one as production without deleting old ones.

# Folder layout
models/
  iris_clf/
    v0.1.0/
      model.pkl
      metrics.json  # {"f1": 0.91, "timestamp": "..."}
      params.json   # {"seed": 42, "C": 1.0}
    v0.2.0/
      model.pkl
      metrics.json  # {"f1": 0.94}
      params.json
    registry.json

# registry.json (minimal)
{
  "name": "iris_clf",
  "versions": [
    {"version": "v0.1.0", "stage": "archived"},
    {"version": "v0.2.0", "stage": "production"}
  ]
}

Result: Your serving system reads the production tag from registry.json and loads that version. Rollback = change the stage mapping.

Example 3: Repro recipe ties it all together

Goal: Fully reproduce a training run.

# Repro record (yaml/json)
run_id: 2026-01-01-iris-001
code_commit: 9f2e1c4
data_snapshot: iris_v0.1  # points to manifest with SHA256
params:
  seed: 42
  C: 1.0
  penalty: l2
env:
  python: 3.10
  requirements_lock: sha256:7b1f...
  container_image: ghcr.io/org/ml:1.2.3
outputs:
  model_version: v0.2.0
  metrics:
    f1: 0.94

Result: Anyone with access to code commit 9f2e1c4, the iris_v0.1 snapshot, and this environment can recreate model v0.2.0 and the same metrics.

How to implement in a small team

Pick identifiers: semantic versions for releases (v0.1.0), hashes for integrity.
Choose storage: a shared folder or object store; keep old versions immutable.
Create manifests: for each data snapshot and model version.
Record lineage: code commit + data snapshot + params + env → model version.
Automate checks: verify SHA256 before training and before deployment.

Minimal folder template you can copy

project/
  data/
    snapshots/
      <name>_vX.Y/
        ...files
      manifest.json
  models/
    <model_name>/
      vX.Y.Z/
        model.pkl
        metrics.json
        params.json
      registry.json
  runs/
    <timestamp-id>.yaml  # repro record
  code/
    ...
  env/
    requirements.lock

Exercises

Hands-on tasks mirror the graded exercises below. Do them locally, then mark your checklist.

[ ] Exercise 1: Create a dataset snapshot and manifest with SHA256.
[ ] Exercise 2: Produce two model versions and mark one as production.
[ ] Exercise 3: Write a complete repro record and verify it end-to-end.

Note: The Quick Test is available to everyone. If you log in, your exercise and test progress will be saved to your profile.

Common mistakes and self-check

Overwriting data files in place. Self-check: Are old versions still retrievable?
Missing environment locks. Self-check: Do you have a requirements.lock or container tag?
Unclear production pointer. Self-check: Is there a single source of truth (registry) for which model serves?
No integrity verification. Self-check: Do you re-hash files before training/deploying?
Mixing dev experiments with releases. Self-check: Do only tagged versions reach staging/production?

How to recover if you already overwrote files

Create a new snapshot from your current state, hash it, and freeze it. Then update your processes to forbid in-place edits.

Mini tasks (5–10 minutes)

Create a semantic versioning policy: what changes bump major/minor/patch?
Add a checksum verification step to your training script.
Write a short model card template (inputs, training data version, metrics, intended use).

Practical projects

Build a lightweight model registry using JSON files and folders. Support stages and comments.
Create a data snapshotter script that walks a folder, computes SHA256, writes/updates manifest.json, and verifies integrity.
Automate a CI step: on new tag vX.Y.Z, verify data snapshot hash, run training, produce artifacts, and update registry.json.

Who this is for

Engineers and data scientists who train, evaluate, or ship ML models and need reproducibility and safe rollbacks.

Prerequisites

Basic command line usage.
Familiarity with Git concepts (commits, tags).
Ability to run Python and install packages.

Learning path

Before: Source control fundamentals, clean data pipelines.
Now: Data and model versioning (this lesson).
Next: Experiment tracking, deployment workflows, monitoring and rollback procedures.

Next steps

Finish the exercises and verify your manifests and registry work.
Take the Quick Test to check your understanding.
Integrate version checks into your team’s training and deployment scripts.

Mini challenge

Given a failing production model, demonstrate a rollback in under 5 minutes using your registry.json. Document the exact steps and verify that monitoring reflects the new (old) version.

Menu

Data And Model Versioning

Table of Contents

Why this matters

Concept explained simply

Mental model

Core building blocks

Worked examples

How to implement in a small team

Exercises

Common mistakes and self-check

Mini tasks (5–10 minutes)

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Create and lock a dataset snapshot with a manifest

Instructions

Expected Output

Version two models and mark one as production

Write a full repro record and validate it

Data And Model Versioning — Quick Test

Have questions about Data And Model Versioning?

AI Assistant