5) Follow-up
Create an incident note, snapshot logs/telemetry, and plan a safe roll-forward.
Monitoring and rollback triggers
- Latency: p95 and p99 compared to baseline (e.g., +20% = rollback).
- Error rate: 5xx or model timeouts above threshold (e.g., >0.5%).
- Quality proxies: false-positive/false-negative rate on sampled labeled checks; business KPIs like “false alerts per 1k frames”.
- Resource: CPU/GPU utilization spikes causing queue buildup.
- Edge health: FPS drops, thermal events, battery drain beyond limits.
Exercises
Do these to apply what you learned. You can compare with the solutions below each exercise.
Exercise 1: Design a canary plan
Draft a canary release for a face detection API from v2.1 to v2.2 with gates for latency, error rate, and FP rate. Include steps, thresholds, time windows, and rollback actions.
- Tip: Include shadowing, percent steps, and auto-promotion rules.
Exercise 2: Safe schema change
You need to add a new JSON column to store extra detection metadata while keeping consumers working. Outline a backward-compatible migration and rollback plan.
Checklist: Did you cover these?
- Baseline metrics documented
- Promotion steps and durations
- Automated gates with exact thresholds
- Single-command or single-flag rollback
- Verification after rollback
Common mistakes and how to self-check
- No clear rollback owner: Assign a primary and a backup on-call.
- Non-backward compatible change: Use additive migrations and dual-write until stable.
- Only infra metrics: Add model quality proxies and business KPIs.
- Skipping warm-up: Pre-warm models to avoid cold-start latency spikes.
- Overfitting to offline tests: Shadow or canary with real traffic before full rollout.
Self-check prompts
- Can I roll back in under 2 minutes?
- What metric breaches exactly trigger rollback?
- How do I verify that rollback fixed the issue?
Practical projects
- Build a mock CV microservice with two model versions and implement canary routing controlled by a config file and environment variables.
- Create dashboards that compare baseline vs canary for latency and a simple FP proxy using a labeled test slice.
- Prototype a kill switch for an edge app that swaps models locally when a remote flag is set.
Mini challenge
Your new OCR model is 10% faster but increases false positives on small receipts. Propose a safe-release approach that keeps speed for large receipts while protecting small ones. Hint: conditional routing with a feature flag keyed by image size, with separate canary gates.
Who this is for
- Computer Vision Engineers deploying models to production
- ML Engineers and MLOps practitioners
- Backend engineers owning ML-driven APIs
Prerequisites
- Basic model serving knowledge (REST/gRPC or batch pipelines)
- Familiarity with metrics and alerting
- Comfort with version control and CI/CD concepts
Learning path
- Start: Model packaging and versioning
- Next: Monitoring and alerting for CV services
- Then: Release strategies (shadow, canary, blue-green)
- Now: Rollback and safe releases (this lesson)
- After: Incident response and postmortems; roll-forward strategies
Next steps
- Turn one worked example into a small demo in your environment.
- Write a 1–2 page rollback playbook that fits your stack.
- Run a game day: simulate a failure and practice rollback timing.
Ready? Take the Quick Test below. Log in to save your progress.