Why this skill matters for MLOps Engineers
Reliable ML systems depend on knowing exactly which data and which model produced a result. Data and model versioning lets you reproduce experiments, audit decisions, roll back safely, and collaborate across teams. As an MLOps Engineer, you’ll design the guardrails: snapshotting datasets, tracking schema changes, linking lineage from source to features to models, and ensuring training is reproducible even as data drifts and labels get corrected.
What you’ll be able to do
- Create immutable dataset snapshots with manifests and checksums.
- Version labels and apply corrections without breaking past experiments.
- Define and enforce schema contracts so pipelines fail fast on breaking changes.
- Trace lineage from raw sources to features to model versions for auditability.
- Reproduce training sets on demand, including late-arriving data reprocessing.
- Version, register, and compare models with clear metadata and metrics.
Who this is for
- MLOps Engineers enabling reliable ML delivery.
- Data/ML Engineers building pipelines and feature stores.
- Data Scientists who want reproducible experiments and audit-ready workflows.
Prerequisites
- Git basics (branches, commits, tags).
- Python fundamentals and virtual environments.
- Familiarity with data files (CSV/Parquet) and basic ML training.
Learning path (roadmap)
- Snapshot data: Create dataset manifests with hashes; store alongside code.
- Pipeline stages: Define reproducible steps (preprocess, train) and lock dependencies.
- Model artifact tracking: Log parameters, metrics, and model files per run.
- Schema contracts: Validate input/feature schemas; version them.
- Label versioning: Track label corrections and impact on metrics.
- Lineage: Link raw data commit → feature job version → model run.
- Late data and reprocessing: Backfill safely and tag new dataset/model versions.
Why teams adopt versioning
It reduces outages from silent data changes, speeds up debugging, enables audits/compliance, and lets you safely compare experiments across time.
Choosing tools
Pick tools that fit your stack and scale. Common choices include Git/Git-LFS for code/artifacts, data versioning tools (e.g., DVC-like workflows), and run tracking/registries (e.g., MLflow-like approaches). The practices below are tool-agnostic.
Worked examples
1) Dataset snapshot with manifest and checksum
Create a manifest capturing files, sizes, and checksums so the snapshot is immutable.
{
"dataset": "customers_2024_09_01",
"created_at": "2024-09-01T00:00:00Z",
"files": [
{"path": "data/raw/customers.csv", "rows": 124532, "sha256": "a1b2..."},
{"path": "data/raw/transactions.parquet", "rows": 982341, "sha256": "9f8e..."}
],
"notes": "Initial September snapshot"
}
Compute checksums in Python:
import hashlib
def sha256_file(p):
h = hashlib.sha256()
with open(p, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk)
return h.hexdigest()
print(sha256_file("data/raw/customers.csv"))
Tip: Keep manifests small
Store per-file hashes and high-level stats. Avoid bloated manifests by excluding volatile, non-deterministic fields.
2) Reproducible pipeline stages (data → features → model)
Define stages that specify dependencies and outputs. Example dvc-like YAML:
stages:
preprocess:
cmd: python scripts/preprocess.py --in data/raw/customers.csv --out data/processed/customers_clean.csv
deps: [data/raw/customers.csv, scripts/preprocess.py]
outs: [data/processed/customers_clean.csv]
train:
cmd: python scripts/train.py --data data/processed/customers_clean.csv --model models/model.pkl --metrics metrics.json
deps: [data/processed/customers_clean.csv, scripts/train.py]
outs: [models/model.pkl]
metrics: [metrics.json]
Re-run when inputs change to regenerate deterministic outputs.
3) Model versioning with run tracking and registry
Track every training run and register a model version.
import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pickle, json
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
mlflow.set_tracking_uri("file:./mlruns") # local store
mlflow.set_experiment("churn")
with mlflow.start_run() as run:
model = LogisticRegression(max_iter=200)
model.fit(Xtr, ytr)
acc = model.score(Xte, yte)
mlflow.log_param("model", "logreg")
mlflow.log_metric("accuracy", acc)
with open("models/model.pkl", "wb") as f:
pickle.dump(model, f)
mlflow.log_artifact("models/model.pkl", artifact_path="model")
# Optional: "register" by tagging the run
mlflow.set_tag("registered_as", "churn-model")
mlflow.set_tag("version", "1.0.0")
with open("metrics.json", "w") as f:
json.dump({"accuracy": acc}, f)
Semantic versions for models
Use MAJOR.MINOR.PATCH. Bump MAJOR for breaking feature/schema changes, MINOR for improvements with same inputs, PATCH for bugfixes/hyperparam tweaks.
4) Schema contracts and validation
Define expected schema for your features. Fail fast on breaking changes.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "CustomerFeatures v2",
"type": "object",
"properties": {
"age": {"type": "integer", "minimum": 0},
"tenure_months": {"type": "integer", "minimum": 0},
"avg_txn_amount": {"type": "number"},
"is_premium": {"type": "boolean"}
},
"required": ["age", "tenure_months", "avg_txn_amount"],
"additionalProperties": false
}
Validate before training and serving.
5) Label versioning and corrections
Track label source and correction maps so you can reproduce past metrics or update them consistently.
{
"label_set": "churn_labels",
"version": "2024.09.01",
"source": "crm_export_2024_09_01.csv",
"corrections": [
{"id": 12345, "old": 0, "new": 1, "reason": "ticket_3491"},
{"id": 67890, "old": 1, "new": 0, "reason": "appeal_556"}
]
}
Impact analysis
Train with labels v2024.09.01 and compare to v2024.09.15. Record metric deltas with run tags so stakeholders know why metrics changed.
6) Handling late data and controlled reprocessing
Keep the original snapshot immutable; produce a revision for late arrivals.
{
"snapshot_id": "customers_2024_09_01",
"revision": 2,
"late_files": ["delta/2024-09-02.parquet"],
"note": "Late data backfill; no schema changes"
}
Re-run pipelines to generate training_set: customers_2024_09_01_rev2, then train model v1.1.0.
Drills and quick exercises
- Create a manifest for a small CSV dataset with row counts and SHA-256 hashes.
- Add a preprocessing stage that depends on both raw data and a script file; verify it re-runs when either changes.
- Write a JSON schema for 5 features and validate a sample batch; try breaking it and confirm it fails.
- Tag two runs (v1.0.0 and v1.1.0) and record metrics.json for each; compare accuracy delta.
- Simulate a label correction file and re-train; note how many records changed and the metric impact.
- Create a lineage file linking raw snapshot → feature job ver → run_id; confirm each artifact’s checksum is referenced.
Common mistakes and debugging tips
Storing large data directly in Git
Symptom: slow clones and huge repos. Fix: store large files in external storage or use artifact stores; keep pointers/metadata in Git.
Changing data in place
Symptom: runs become unreproducible. Fix: treat datasets as immutable; create new snapshots or revisions with manifests.
Implicit schema drift
Symptom: odd model drops after upstream changes. Fix: enforce schema validation and version schemas; fail fast and alert.
Untracked randomness
Symptom: different results from the same inputs. Fix: set random seeds, fix library versions, and record them in run metadata.
Missing lineage
Symptom: cannot explain a model decision. Fix: store lineage linking data snapshot, feature job version, and model run/commit.
Mini project: Versioned churn pipeline
Goal: Build a small but realistic churn model with full versioning.
- Snapshot: Save raw customers.csv and transactions.parquet with a manifest (checksums, rows).
- Preprocess stage: Create deterministic feature table from raw; record schema v1.
- Train stage: Train logistic regression; log params, metrics, and model artifact; tag model v1.0.0.
- Label correction: Apply a small correction file; tag label set v1.1; retrain model v1.1.0.
- Lineage: Write lineage.json mapping snapshot → preprocess v1 → run_id for both models.
- Late data: Add a partition of late transactions; create snapshot revision 2; reprocess features and train v1.2.0; compare metrics across versions.
Deliverables checklist
- manifest.json for each snapshot
- schema.json (v1, v2 if changed)
- dvc-like pipeline yaml or equivalent
- metrics.json per run
- lineage.json linking data → features → run
Practical project ideas
- Fraud detection: Hourly partitions with late-arriving corrections; strict schema and rolling backfills.
- Recommendation engine: Version user-item interactions by week; compare model v1.0 vs v2.0 after feature additions.
- Demand forecasting: Version holiday calendars and promotions; demonstrate how corrections affect backtests.
Subskills
- Dataset Snapshots And Manifests: Immutable, checksummed snapshots enable exact reproducibility.
- Label Versioning And Corrections: Track label sources and corrections; quantify metric impact.
- Schema Versioning And Contracts: Enforce input/feature schemas; version and validate.
- Lineage Source To Feature To Model: Link artifacts end-to-end for audits and debugging.
- Reproducible Training Sets: Deterministic pipelines, locked deps, and recorded seeds.
- Handling Late Data And Reprocessing: Backfill safely via revisions and clear tags.
Next steps
- Implement versioning on a current project: start with manifests and schema validation.
- Add run tracking and a simple model registry process.
- Practice late-data backfills and label corrections in a sandbox before production.