Why this matters
In production ML, models must be reproducible, auditable, and safe to roll back. Linking each model to the exact data, code, environment, and configuration that produced it gives you:
- Reproducibility: rebuild the same model when needed (bug fixes, audits, or regulators).
- Diagnostics: trace performance drops to data or code changes quickly.
- Governance: show lineage from business input to deployed artifact.
- Operational speed: automate promotions and rollbacks with confidence.
Real tasks you will face
- Attach dataset version, Git commit, and Docker image digest to each registry entry.
- Record feature store snapshot IDs for offline training and online serving.
- Create a provenance manifest and store it next to the model artifact.
- Block promotion if lineage is incomplete (e.g., missing schema hash).
Concept explained simply
Linking means storing unambiguous identifiers that connect a model to all ingredients used to create it. Think of it as a recipe card pinned to the dish.
Mental model
Imagine a "model box" with labels on all sides:
- Front: model version and metrics.
- Left: data snapshot ID and schema hash.
- Right: Git commit SHA and training script path.
- Back: Docker image digest and dependency lockfile hash.
- Bottom: random seeds, hyperparameters, and hardware.
Anyone can pick up the box, read the labels, and rebuild the exact model.
What to link for strong lineage
- Data lineage:
- Dataset version ID or content hash (e.g., DVC hash, snapshot timestamp).
- Source URIs (e.g., s3/gcs paths) and row counts.
- Schema hash and feature list; for feature stores, feature view names and their versions.
- Time range for time-based snapshots to avoid leakage.
- Code lineage:
- Git commit SHA (immutable) and optionally a tag or release name.
- Training entry point (script/module) and config file version.
- Environment:
- Container image digest (sha256), not just the tag.
- Dependency lock (e.g., requirements.txt or conda-lock) checksum.
- Run context:
- Random seeds, library/framework versions, hardware/accelerator type.
- Training/eval dataset splits identifiers.
- Outputs:
- Model artifact checksum, eval metrics, and model signature (input/output schema).
Worked examples
Example 1: MLflow run with complete linkage
import os, json, hashlib, mlflow
from datetime import datetime
def file_sha256(path):
h = hashlib.sha256()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(1_048_576), b''):
h.update(chunk)
return h.hexdigest()
data_version = os.environ.get("DATA_VERSION", "dvc:3d2f9c7")
git_commit = os.environ.get("GIT_COMMIT", "9f1a2b7c4d...")
docker_digest = os.environ.get("DOCKER_DIGEST", "sha256:ab12...")
conda_lock = "conda-lock.yml"
conda_hash = file_sha256(conda_lock)
mlflow.set_experiment("fraud_model")
with mlflow.start_run(run_name="xgboost_v21") as run:
mlflow.log_params({"max_depth": 6, "eta": 0.15, "seed": 42})
# train() ... save model to model.pkl
mlflow.log_artifact("model.pkl")
mlflow.log_artifact(conda_lock)
mlflow.log_metrics({"auc": 0.924, "f1": 0.81})
mlflow.set_tags({
"git_commit": git_commit,
"data_version": data_version,
"docker_image_digest": docker_digest,
"conda_lock_sha256": conda_hash,
"train_start": datetime.utcnow().isoformat() + "Z",
"feature_views": "transactions_v5, customers_v3",
"schema_hash": "c4b7a10..."
})
# Register
model_uri = f"runs:/{run.info.run_id}/model.pkl"
mlflow.register_model(model_uri, "fraud_classifier")
What this gives you
- Registry entry referencing the exact run (contains commit, data, and env).
- Artifacts include model and environment lock file.
- Tags act as quick lineage fields for dashboards and checks.
Example 2: DVC data version linked into the registry
# Version your dataset
$ dvc add data/train.parquet
$ dvc push
$ git add data/train.parquet.dvc .gitignore
$ git commit -m "Version training dataset v57"
$ dvc status # should be up-to-date
# Extract DVC hash from .dvc file and pass to your run as DATA_VERSION
import yaml, mlflow
with open("data/train.parquet.dvc") as f:
dvc_meta = yaml.safe_load(f)
# e.g., md5: "a1b2c3..."
DATA_VERSION = f"dvc:{dvc_meta['outs'][0]['md5']}"
with mlflow.start_run():
# ... training
mlflow.set_tag("data_version", DATA_VERSION)
Outcome
The model's registry record links to an immutable data snapshot via DVC hash.
Example 3: Capture container and dependency immutability
# Build image and record its digest
$ docker build -t fraud-train:21 .
$ docker inspect --format='{{index .RepoDigests 0}}' fraud-train:21
# Output: your-repo/fraud-train@sha256:7e9c...
# Save in provenance.json alongside the model
{
"docker_image_digest": "sha256:7e9c...",
"requirements_lock_sha256": "fb12c...",
"python": "3.11.6",
"cuda": "12.2"
}
Outcome
Anyone can recreate the environment exactly by pulling the digest and using the same lockfile.
Who this is for
- MLOps Engineers setting up registries and CI/CD for ML.
- Data Scientists who need reliable experiment-to-production traceability.
- Platform Engineers building ML platforms and compliance workflows.
Prerequisites
- Basic Git usage (commits, branches).
- Familiarity with containers (Docker) and Python packaging.
- Understanding of your model registry tool (e.g., concepts like model, run, version, stage).
Learning path
- Version your data (DVC or snapshot IDs) and compute schema hashes.
- Standardize run metadata: define required tags/fields for all training jobs.
- Automate provenance manifest generation in training pipelines.
- Enforce checks in CI/CD: block registry promotion when lineage is incomplete.
- Dashboards: surface lineage fields for debugging and compliance.
Hands-on exercises
Do these to build muscle memory. A test is available at the end. Everyone can take it; saved progress is available to logged-in users.
Exercise 1 — MLflow run with full lineage
Goal: Create an MLflow run that links to data, code, and environment using tags and artifacts.
- Set environment variables: GIT_COMMIT, DATA_VERSION, DOCKER_DIGEST.
- Run the provided script to train a trivial model, log metrics, and register it.
- Verify the run contains tags: git_commit, data_version, docker_image_digest, conda_lock_sha256.
Show hints
- Use mlflow.set_tags to set lineage fields.
- Compute file hash with hashlib.sha256.
- Register using runs:/<run_id>/model.pkl URI.
Show solution
import os, mlflow, hashlib
def sha256f(p):
import hashlib
h = hashlib.sha256()
with open(p,'rb') as f:
for b in iter(lambda: f.read(1_048_576), b''):
h.update(b)
return h.hexdigest()
os.environ.setdefault("GIT_COMMIT", "demo-commit-1234567")
os.environ.setdefault("DATA_VERSION", "dvc:a1b2c3d4")
os.environ.setdefault("DOCKER_DIGEST", "sha256:deadbeef...")
mlflow.set_experiment("demo_linkage")
with mlflow.start_run() as r:
# pretend training
open("model.pkl","wb").write(b"demo")
open("conda-lock.yml","w").write("demo-lock")
mlflow.log_artifact("model.pkl")
mlflow.log_artifact("conda-lock.yml")
mlflow.log_metrics({"auc":0.9})
mlflow.set_tags({
"git_commit": os.environ["GIT_COMMIT"],
"data_version": os.environ["DATA_VERSION"],
"docker_image_digest": os.environ["DOCKER_DIGEST"],
"conda_lock_sha256": sha256f("conda-lock.yml")
})
mlflow.register_model(f"runs:/{r.info.run_id}/model.pkl", "demo_model")
print("Done")
Exercise 2 — Write a provenance manifest
Goal: Author a provenance.json that captures all critical lineage fields, then store it next to your model artifact.
- Create provenance.json with keys: model, run, data, code, env, training, evaluation, approvals.
- Populate with realistic values from your environment.
- Add it to registry artifacts or the same storage location as the model file.
Show hints
- Use sha256 of artifacts and lockfiles.
- Include time ranges and snapshot IDs for datasets.
- Use commit SHA for code; tags are optional.
Show solution
{
"model": {"name": "fraud_classifier", "version": "21", "artifact_sha256": "9abc..."},
"run": {"id": "bd12...", "started": "2026-01-04T12:00:00Z"},
"data": {
"source": "s3://ml/data/fraud/train_v57.parquet",
"version": "dvc:a1b2c3d4",
"rows": 1250043,
"schema_hash": "c4b7a10...",
"time_range": {"start": "2025-11-01", "end": "2025-12-31"}
},
"features": {
"views": [{"name": "transactions", "version": "v5"}, {"name": "customers", "version": "v3"}],
"transforms_commit": "7f9e2c1..."
},
"code": {"repo": "internal", "commit": "9f1a2b7c4d...", "entry_point": "train.py", "config_file": "config.yaml"},
"env": {
"docker_image_digest": "sha256:7e9c...",
"python": "3.11.6",
"requirements_lock_sha256": "fb12c..."
},
"training": {"seed": 42, "framework": {"name": "xgboost", "version": "2.0.3"}, "hardware": "A10G x1"},
"evaluation": {"datasets": ["validation_v21", "test_v9"], "metrics": {"auc": 0.924, "f1": 0.81}},
"approvals": {"owner": "risk-ml", "approved": true, "approved_by": "ml-lead"}
}
Checklist (verify before promoting a model)
- Data: version/hash, schema hash, row count, and time range are present.
- Code: commit SHA and training entry point recorded.
- Environment: container digest and dependency lock checksum recorded.
- Run context: seeds, framework versions, and hardware noted.
- Outputs: model checksum, signature, and metrics stored.
Common mistakes and self-check
- Using mutable identifiers: tags like latest instead of commit SHAs or image digests. Self-check: Can the ID change without notice? If yes, replace it.
- Missing schema/version info: Only pointing to a bucket path. Self-check: If the file changed in place, could you detect it? Add content hashes.
- Partial lineage: Logging data but not environment. Self-check: Could others install identical deps? If not, add lockfile hash and image digest.
- No split IDs: Not recording which rows belonged to validation/test. Self-check: Can you rebuild the same splits? Save split seeds or snapshot IDs.
- Forgetting feature store versions: Logging computed features without view versions. Self-check: Are feature definitions versioned and captured?
Practical projects
- Provenance policy: Define a JSON schema for lineage and a validator that blocks promotion if fields are missing.
- CI enforcement: A script that reads MLflow run tags and fails the build if commit/image/data hashes are absent.
- Drift triage notebook: Given a model version, auto-load its lineage, fetch the exact data snapshot, and rerun evaluation.
Mini challenge
Pick one of your existing models. In 20 minutes, produce a minimal provenance.json that includes: data_version, git_commit, docker_image_digest, schema_hash, and metric AUC. Store it next to the model artifact and record its checksum as a registry tag.
Next steps
- Automate provenance generation inside training pipelines.
- Enable a registry rule: models missing data_version, git_commit, or docker_image_digest cannot be promoted.
- Proceed to the quick test below to assess your readiness.
Progress & test
The quick test below is open to everyone. Logged-in users will have their progress saved automatically.