Who this is for
Computer Vision Engineers and ML practitioners who deploy models to production and need safe, observable rollouts without disrupting users or downstream systems.
Prerequisites
- [ ] Basic understanding of model serving (REST/gRPC endpoints)
- [ ] Comfort with containers and environment configuration
- [ ] Familiarity with metrics (latency, error rate) and model quality (precision/recall)
Learning path
- Learn version identifiers: semantic, timestamp, and data-aware versions.
- Set up a registry or naming convention for models and Docker images.
- Add observability: request IDs, model version tags, and key metrics.
- Practice canary strategies: shadow traffic, weighted routing, and rollback.
- Automate checks and rollout gates.
Why this matters
In real computer vision systems (content moderation, defect detection, OCR, retail analytics), a model update can improve accuracy but also risk outages or quality regressions. Versioning makes changes traceable and reversible. Canary releases let you test new models with a small slice of traffic, measure impact, and roll forward or back safely.
Concept explained simply
Model versioning is giving every model build a unique label and storing its exact recipe (code, data, parameters, and artifacts). Canary release is rolling out a new model to a small percentage of users first, checking health and quality, and only then increasing the share.
Mental model
Think of a dimmer switch, not an on/off switch. You gradually turn up the brightness (traffic to the new model) while watching gauges (metrics). If any gauge goes red, you turn the brightness down instantly (rollback).
Versioning strategies
Semantic versioning (e.g., 2.1.0)
Use MAJOR.MINOR.PATCH for API-breaking, feature-level, and bug-fix changes. Good for human communication and release notes.
Build/timestamp versioning (e.g., 2026-01-05-1420)
Monotonic IDs from CI runs ensure uniqueness and traceability to pipelines and artifacts.
Data-aware versioning
Include data hash or dataset tag (e.g., data=b4f7a1). Crucial in CV where data shifts change performance. Helps reproduce exact training conditions.
At minimum, label: model version, code commit, dataset tag, training config hash, and serving image tag.
Canary release patterns
- Shadow (mirror) testing: New model receives a copy of requests but does not affect user responses. Compares outputs offline.
- Weighted routing: X% of live traffic hits the new model; users see its results. Increase weight if metrics are good.
- Sticky cohorts: Keep specific users/devices on one version to avoid flip-flop outputs affecting UX.
- Feature gating: Use flags to enable/disable new model features without redeploy.
Metrics and guardrails
- System: p50/p95 latency, error rate, timeouts, GPU/CPU utilization, memory.
- Model: precision/recall/F1, IoU (segmentation), mAP (detection), OCR CER/WER.
- Business: false reject/accept rates, throughput, cost per 1k inferences.
Set guardrails, e.g.: "p95 latency ≤ baseline + 10%", "mAP drop < 1 point", "error rate < 0.5%". Automate alerts and stop-the-world rollback if violated.
Worked examples
Example 1: Object detection upgrade (v1.8.2 → v2.0.0)
Plan: shadow for 24h, then 5% → 25% → 50% → 100%.
- Shadow: Compare mAP on labeled canary subset. If mAP +2 and p95 latency +<8%, proceed.
- 5% weighted: Monitor live error rate and regression alerts on critical classes (person, vehicle).
- Rollback trigger: Any class-wise precision drop > 3 points for 15 min.
Example 2: Segmentation model with stricter post-processing
New thresholding reduces false positives but risks missing small defects.
- Use a sticky cohort of 500 devices from 3 factories.
- Guardrail: IoU for "micro-crack" class must not drop more than 1 point.
- If cost per defect caught improves ≥ 5% with stable IoU, expand.
Example 3: OCR model with larger tokenizer
Concern: higher memory and cold-start time.
- Pre-warm two replicas before shifting 10% traffic.
- Guardrail: p95 cold-start < 1.5x baseline, steady-state latency within +10%.
- Rollback on OOM events or error spikes > 0.3% for 10 min.
Step-by-step: Implement a canary for a CV inference service
- Tag everything: Model: cv-detector:2.0.0+data=b4f7a1; Image: cv-detector:serve-2026-01-05.
- Expose metrics: Add model_version to logs/metrics. Include request_id and cohort_id.
- Shadow stage: Mirror 100% of traffic to v2; store predictions and latency. Compare to v1 on a labeled slice.
- Define guardrails: e.g., mAP ≥ v1−1pt; p95 latency ≤ v1+10%.
- Weighted rollout: 5% → 25% → 50% → 100%, each for 30–120 min with automated checks.
- Rollback plan: One-click route-to-v1; keep v2 pods warm for quick retry later.
- Post-release: Freeze v1 as stable; archive artifacts and training manifest.
Readiness checklist
- [ ] Version tags for model, code, data, and image are recorded
- [ ] Telemetry includes model_version and request_id
- [ ] Shadow results reviewed and signed off
- [ ] Guardrails and rollback triggers documented
- [ ] Rollout stages and owners assigned
- [ ] Post-release archiving plan in place
Exercises
Do the tasks below. The quick test is available to everyone; sign in to save your progress.
Exercise 1: Design a canary plan for a segmentation model
You are upgrading a factory floor segmentation model from v1.4.3 to v1.5.0. Draft a 3-stage rollout with guardrails and a conceptual routing config. Include what to monitor and when to roll back.
Input hint: Sample conceptual routing config
# Pseudo YAML for weighted routing (conceptual)
service: seg-service
versions:
- name: v1-4-3
weight: 95
- name: v1-5-0
weight: 5
metrics_guardrails:
p95_latency_increase_pct: <= 10
error_rate_pct: <= 0.5
iou_drop_points: <= 1
rollout_stages:
- {weight: 5, duration: 60m}
- {weight: 25, duration: 120m}
- {weight: 50, duration: 180m}
Show solution
Stages: 5% for 60 min → 25% for 120 min → 50% for 180 min → then 100% if stable for 24h.
Guardrails: p95 latency ≤ +10% vs v1.4.3; error rate ≤ 0.5%; class-wise IoU drop ≤ 1 point for "spill" and "person" classes.
Monitoring: system (latency, errors), model (IoU by class), business (missed hazard rate). Use sticky cohort of high-traffic lines.
Rollback: Immediate if any guardrail breached for 10 consecutive minutes. Keep v1 pods warm; annotate incident and freeze rollout.
# Conceptual routing change per stage (weights change only)
versions:
- {name: v1-4-3, weight: 95}
- {name: v1-5-0, weight: 5}
# ... then 75/25, then 50/50
Common mistakes and self-check
- No data version recorded: Self-check: Can you reproduce training with the exact dataset tag/hash?
- Only system metrics monitored: Self-check: Do you track class-wise metrics and business KPIs?
- Skipping shadow tests: Self-check: Do you have evidence the new model performs on key cohorts before user-facing traffic?
- Unclear rollback: Self-check: Can you revert traffic in one action and within one minute?
- Non-sticky canary users: Self-check: Are canary users consistently routed to the same version?
Practical projects
- Build a mini model registry: Store model artifact, dataset tag, and metrics in a simple manifest (JSON) and serve it via your API.
- Shadow inference pipeline: Duplicate incoming requests to a second model and compute delta metrics offline.
- Guardrail dashboard: Visualize p50/p95 latency, error rate, and two model metrics with per-version filters.
Mini challenge
Write a one-page runbook for canarying a new defect detection model in a GPU-constrained environment. Include pre-warm strategy, rollout stages, guardrails, and rollback steps. Aim for clear, actionable bullet points.
Next steps
- Automate rollout gates with metric thresholds.
- Add cohort-level evaluation reports to every release.
- Standardize naming: model, data, and image tags in every log line.