What is MLOps for Vision Systems?
MLOps for vision is the discipline of reliably training, evaluating, deploying, and operating computer vision models at scale. It connects data pipelines, training, evaluation gates, deployment, monitoring, and incident response so your detectors/classifiers/segmenters stay accurate, fast, and compliant in production.
Why it matters for a Computer Vision Engineer
- Ship models faster with repeatable pipelines and reproducibility.
- Prevent silent performance drops with automated evaluation and drift monitoring.
- Keep latency, GPU costs, and accuracy in balance for real-time workloads.
- Meet safety, privacy, and governance requirements across data and models.
Who this is for
- Computer Vision Engineers moving from notebooks to production.
- ML Engineers responsible for training and deployment of vision models.
- Data/Platform Engineers building CV pipelines and services.
Prerequisites
- Python and basic CV (e.g., classification/detection/segmentation).
- Familiarity with PyTorch or TensorFlow.
- Comfort with Git and simple CI concepts.
- Basic Linux, containers, and shell scripting.
Learning path (roadmap)
- Data and model versioning — track datasets, labels, and models; ensure full reproducibility.
- Continuous training (CT) basics — automate data ingestion, training, and packaging with CI/CD.
- Automated evaluation gates — block bad models using metrics and guardrails (accuracy, latency, bias, size).
- Monitoring drift and data quality — detect when inputs or outputs change in ways that harm performance.
- Observability for inference — log latency, errors, throughput, hot paths, and resource usage.
- Incident response — alerting, rollback, canaries, and RCAs for performance degradation.
- Documentation & governance — model cards, lineage, approvals, and auditability.
Milestone check: Are you production-ready?
- Can you reproduce a model (code+data+params) from a tag?
- Is training automated and gated by quality checks?
- Do you have runtime metrics and drift alerts?
- Can you rollback safely within minutes?
- Is documentation up to date with a recorded lineage?
Worked examples
1) Versioning data and models together
Goal: Ensure you can rebuild a model exactly, including dataset and hyperparameters.
# Example: lightweight DVC-like workflow with Git
# 1) Track data
!git init
!git add src/ train.py params.yaml
!git commit -m "Init code"
# Assume dataset in data/raw, preprocessed to data/processed
# Use a data versioning tool or hashes you compute yourself
# 2) Save params and model artifacts
# params.yaml
train:
lr: 0.001
batch_size: 32
epochs: 12
# train.py (excerpt)
import json, hashlib, os
from datetime import datetime
run_id = datetime.utcnow().strftime('%Y%m%d%H%M%S')
with open('params.yaml') as f:
params = f.read()
run_hash = hashlib.sha1((params).encode()).hexdigest()[:8]
# ... after training
os.makedirs(f"artifacts/{run_id}_{run_hash}", exist_ok=True)
with open(f"artifacts/{run_id}_{run_hash}/metrics.json", 'w') as f:
json.dump({"mAP": 0.41, "latency_ms": 17.4}, f)
# Save model weights and a data manifest
# artifacts//model.pt, data_manifest.json, params.yaml copy Tip: Always save a data manifest (paths, commits, or hashes) with the model, plus params and environment info.
2) A minimal continuous training pipeline
Goal: Automate training on new labeled data and publish a candidate model.
# ci-train.yaml (conceptual)
name: train_pipeline
on: [push]
stages:
- name: prepare
run: |
python scripts/prepare_data.py --input data/raw --out data/processed
- name: train
run: |
python train.py --data data/processed --out artifacts/
- name: evaluate
run: |
python eval.py --model artifacts/model.pt --report artifacts/report.json
- name: package
if: metrics.mAP >= 0.45 and metrics.latency_ms <= 25
run: |
python package.py --model artifacts/model.pt --out dist/
Key idea: The pipeline produces a candidate only if evaluation gates pass.
3) Automated evaluation gates (quality + latency)
# eval.py (excerpt)
import json, time
from metrics import compute_map, compute_latency
mAP = compute_map("artifacts/preds.json", "data/val_labels.json")
latency_ms = compute_latency("dist/model_bundle.onnx", batch=1, reps=50)
# Simple gate
thresholds = {"mAP": 0.45, "latency_ms": 25}
passed = mAP >= thresholds["mAP"] and latency_ms <= thresholds["latency_ms"]
report = {
"mAP": round(mAP, 3),
"latency_ms": round(latency_ms, 2),
"passed": passed
}
with open("artifacts/report.json", "w") as f:
json.dump(report, f)
if not passed:
raise SystemExit("Gate failed: quality/latency thresholds not met")
Add additional gates as needed: model size, bias/fairness slices, memory footprint, cold-start time.
4) Monitoring input drift with PSI
Goal: Detect when production inputs drift from training data (e.g., lighting shifts).
# psi.py
import numpy as np
def psi(expected: np.ndarray, actual: np.ndarray, bins=10):
# Expected/actual are 1D feature arrays (e.g., brightness)
quantiles = np.linspace(0, 1, bins + 1)
e_bins = np.quantile(expected, quantiles)
a_bins = np.quantile(actual, quantiles)
e_hist, _ = np.histogram(expected, bins=e_bins)
a_hist, _ = np.histogram(actual, bins=a_bins)
e_pct = (e_hist + 1e-6) / (len(expected) + 1e-6)
a_pct = (a_hist + 1e-6) / (len(actual) + 1e-6)
return np.sum((a_pct - e_pct) * np.log((a_pct + 1e-6) / (e_pct + 1e-6)))
# Usage: compute drift on brightness feature each hour
Heuristic: PSI around 0.1 suggests moderate shift, 0.2+ significant shift. Investigate and consider retraining or gating traffic.
5) Observability for an inference API
# server.py (conceptual FastAPI-like)
from time import perf_counter
from collections import defaultdict
metrics = defaultdict(int)
def infer(image_bytes):
t0 = perf_counter()
preds = model.predict(image_bytes)
dt = (perf_counter() - t0) * 1000
metrics["requests_total"] += 1
metrics["latency_ms_sum"] += dt
metrics["latency_ms_p95"] = update_p95(metrics.get("lat_samples", []), dt)
metrics.setdefault("lat_samples", []).append(dt)
metrics["errors_total"] += int(preds is None)
metrics["gpu_mem_mb"] = read_gpu_mem()
return preds
# Expose /metrics endpoint for scraping
Track at least: request rate, p50/p95 latency, error rate, model version, device type, GPU memory, and top failure reasons.
Debugging tip: Is it the model or the service?
- Replay a sample of raw images through the model offline. If offline metrics are good, the issue is likely preprocessing, hardware, or concurrency.
- Compare batch=1 vs batch=N latency; if non-linear, you may be saturating memory bandwidth.
Drills and exercises
- Turn a notebook into a reproducible script that reads params from a YAML and writes a metrics.json.
- Add a latency gate to your evaluation step and fail the run if it is above a threshold you pick.
- Compute a simple drift stat (PSI or KL divergence) on an input feature from a rolling 24h window.
- Expose a /metrics endpoint with p95 latency and current model version.
- Simulate a bad model (lower accuracy) and practice rolling back quickly.
Common mistakes and how to avoid them
- No data lineage: Save a manifest (paths, versions, labeler info) with every model.
- Only accuracy gates: Add latency, memory, size, and bias gates.
- Ignoring runtime variance: Measure cold-start, warm latency, and p95 under realistic concurrency.
- Drift alerts without action: Document playbooks: when PSI > 0.2, what happens?
- Overfitting to stale validation sets: Maintain time-sliced or recently-sampled eval sets.
- Poor observability: Always log model version and input schema hashes.
Quick debugging playbook
- Identify scope: accuracy drop, latency spike, or errors?
- Check version: what changed (model, preprocessing, hardware, config)?
- Reproduce offline with logged inputs.
- Feature-wise drill-down: which classes/scenes/regressions?
- Rollback if SLOs are breached; continue RCA on a shadow copy.
Mini project: Vision model with gates, deploy, and monitor
Build a small pipeline for an image classifier and serve it with basic observability.
- Prepare data: sample 2–3 classes, split train/val; compute and save a data manifest.
- Train: script reads params.yaml and outputs artifacts/{run_tag}/model and metrics.json.
- Evaluate: compute accuracy and single-image latency; fail if below thresholds.
- Package: export to ONNX or TorchScript; record model version and manifest.
- Serve: simple HTTP service with /predict and /metrics (p50/p95, errors, version).
- Monitor drift: compute brightness or color histogram PSI hourly; log to a file.
- Incident drill: Introduce synthetic low-light images, watch drift/accuracy change, then rollback.
Success criteria checklist
- Reproducible run tag that rebuilds the same model.
- Automated evaluation gate blocks bad models.
- p95 latency and error rate visible in /metrics.
- Drift signal reacts to lighting changes.
- Rollback completes within minutes.
Practical projects
- Edge detector lifecycle: Train a small object detector, export quantized model, deploy on a low-power device profile, log power and latency, add a size gate.
- Safety camera drift lab: Build a day/night dataset, set separate gates per condition, trigger retraining when night-time PSI exceeds a threshold.
- Batch scoring pipeline: Process a large image folder nightly, compute embeddings, store top-k anomalies, and alert when anomaly rate doubles.
Subskills
- Data And Model Versioning: Track datasets, labels, manifests, params, and artifacts together for full reproducibility.
- Continuous Training Pipelines Basics: Automate data prep, training, testing, and packaging with clear, reproducible steps.
- Automated Evaluation Gates: Enforce accuracy, latency, size, fairness, and resource thresholds before deployment.
- Monitoring Data Drift And Quality: Detect shifts in input distributions and validate schema/quality in near real time.
- Observability For Inference Services: Expose latency, errors, throughput, and resource metrics with model version tags.
- Incident Response For Degradation: Alerts, canary/shadow, rollback, and root-cause analysis playbooks.
- Documentation And Governance: Model cards, lineage, approvals, and audits for safe and compliant operations.
Next steps
- Work through each subskill in order and implement the drills.
- Complete the mini project with real thresholds and logs.
- When ready, take the skill exam below to validate your understanding. The exam is available to everyone; only logged-in users have their progress saved.
Optional: Tooling choices and trade-offs
- Versioning: pick a tool that can store large binary artifacts and link to Git commits.
- Pipelines: start simple with a single YAML and grow into a scheduler if needed.
- Serving: choose CPU/GPU based on latency/cost; measure before optimizing.
- Monitoring: prefer consistent labels (model_version, dataset_tag) for joins later.