Who this is for
This lesson is for MLOps Engineers and ML practitioners who deploy and operate models in production and need safe, measurable ways to roll out new model versions without hurting users or KPIs.
Prerequisites
- Basic understanding of REST/gRPC model serving and versioning
- Familiarity with metrics (latency, error rate, business KPIs) and logging
- Know how traffic routing works at a high level (load balancer, gateway, feature flag)
Why this matters
Real MLOps tasks you’ll face:
- Release a new recommender model to 5% of users and grow to 100% only if CTR improves and latency stays within SLOs.
- Run an A/B test for a fraud model to prove lift in catch rate without increasing false positives beyond budget.
- Automatically roll back if 5xx rate or p95 latency spikes for the new model during canary.
Concept explained simply
Two core ideas:
- Canary release: Gradually shift a small and growing portion of traffic to the new model while watching key metrics. If it’s healthy, increase; if not, pause or roll back.
- A/B test: Randomly split users or requests into groups (A=current model, B=new model) to estimate impact on business metrics with statistical confidence.
Mental model
Think of canary as a safety ramp and A/B as evidence. Canary reduces risk during rollout; A/B provides proof the new version is better (or at least not worse) for users and KPIs.
When to use which?
- Canary only: Minor change, clear guardrails, low risk; your goal is safety.
- A/B (often with small canary first): Significant change; your goal is learning and evidence.
Quick compare
- Canary: focuses on stability/SLOs; traffic ramps (e.g., 1%→5%→20%→50%→100%)
- A/B: focuses on KPI impact; fixed split (e.g., 50/50) until significance or timebox
- Together: Start with a tiny canary to ensure stability, then run A/B to measure uplift
Rollout plan pattern (safe default)
- Shadow traffic (optional): Send copies of requests to the new model; don’t return its output. Check correctness and latency.
- Canary start: 1% traffic, 30–60 min. Guardrails: p50/p95 latency, 5xx rate, error budget burn, critical KPI no worse than X%.
- Ramp steps: 5% → 20% → 50% → 100%. Freeze and observe 30–120 min each step (or N requests).
- Automated rollback: If any guardrail fails, rollback to previous healthy % and open an incident ticket.
- A/B (if needed): Run 50/50 or 30/70 for a pre-defined duration or until reaching required sample size. Decide ship/keep based on primary KPI and guardrails.
Example guardrails
- Latency: p95 < 300 ms and within 10% of control
- Error rate: < 0.5% and not worse than control by more than 0.2%
- Business KPI: No drop > 1% during canary; A/B requires uplift or neutral effect within CI
Worked examples
Example 1: Recommender CTR canary
- Context: Model v2 claims +2% CTR vs v1.
- Plan: Shadow 24h → Canary 1% (30 min), 5% (60 min), 20%, 50%, 100%.
- Guardrails: p95 latency < 250 ms, 5xx < 0.2%, CTR not worse by >1% in any step.
- Outcome: At 20%, CTR +1.5%, latency stable; proceed to 50%, then 100%.
Sample routing rule (conceptual)
model_versions:
v1: 80%
v2: 20% # canary step
criteria:
latency_p95_ms: <= 250
error_rate: <= 0.2%
ctr_drop_vs_v1: <= 1%Example 2: Fraud model A/B with fairness guardrail
- Context: v2 increases catch rate but might raise false positives for a segment.
- Plan: A/B 50/50 for 7 days; primary metric: fraud detected; guardrails: FP rate overall and by segment.
- Outcome: Overall +3% catch with insignificant FP delta, but one segment shows +0.8 pp FP increase. Decision: ship with a rule to cap risk for that segment, then iterate.
Segment guardrail template
guardrails:
- metric: false_positive_rate
by: [region, customer_tier]
threshold: +0.5 pp vs control
action: pause_if_exceededExample 3: NLP search relevance with holdout
- Context: v2 improves offline NDCG but may be slower.
- Plan: Canary to 10% max due to latency risk; then A/B 40/60 (v1/v2). KPI: clicks per search; SLO: p95 < 350 ms.
- Outcome: v2 +1.2% clicks, p95 +6% but within SLO; proceed to ramp.
Metrics and guardrails that matter
- Reliability: p50/p95/p99 latency, error/timeout rate, CPU/Memory/GPU utilization
- Business: CTR, conversion, revenue per session, fraud caught, support tickets
- Quality: precision/recall/AUC in online proxies, calibration error
- Data/behavior shifts: feature drift, segment performance, traffic mix changes
Good decision rule examples
- Proceed to next canary step if: latency p95 within 10% of control AND error rate within 0.2 pp AND KPI not worse than 1%.
- Stop or rollback if any guardrail breached for 10 consecutive minutes or N=1,000 requests.
Safe routing patterns
- Shadow mode: Validate outputs/latency without user impact.
- Sticky sessions: Keep a user in A or B to avoid cross-contamination.
- Random bucketing: Stable hash of user_id/request_id for reproducible splits.
- Feature flags: Kill switches and instant rollbacks.
Common mistakes and how to self-check
- Peeking at A/B too early: Decide minimum duration/sample size first; avoid daily flip-flops.
- Metric mismatch: Watching only latency; forget business KPIs. Include both.
- Biased splits: Not hashing on stable IDs; sessions switch buckets. Use stable bucketing.
- Ignoring segments: Overall KPI flat but harms a segment. Always segment checks.
- No rollback: Canary without auto-rollback increases incident time. Add guardrail-driven rollback.
Self-check checklist
- I have clear primary KPI and guardrails.
- I know exact canary steps and hold times.
- Rollbacks are automatic on guardrail breach.
- Traffic split is stable and auditable.
- Segment-level monitoring is configured.
Exercises
These mirror the exercises section below. Try them now, then compare your answers.
Exercise 1 (ex1): Design a safe canary ramp
- New model v2 may add +2% to KPI. Current SLO: p95 <= 300 ms, error <= 0.3%.
- Create a 5-step canary plan with hold times and exact pass/rollback criteria.
Tip
Use 1%→5%→20%→50%→100%. Include both reliability and KPI checks per step.
Exercise 2 (ex2): A/B decision rule
- You run a 50/50 A/B for 14 days. Define primary KPI, guardrails, required sample size or minimum duration, and the final ship/keep/iterate decision rule.
Tip
Include segment checks and a pre-registered analysis plan (no mid-test scope creep).
Practical projects
- Project 1: Implement a mock gateway config that supports weighted routing for two model versions and a kill switch. Validate via logs.
- Project 2: Build a small A/B evaluator that reads two CSVs of request-level outcomes (A vs B), computes uplift with CIs, and outputs a decision summary.
- Project 3: Add segment guardrails (e.g., by country) and automatically flag a breach in a dashboard panel.
Mini challenge
You see a +1.5% KPI uplift in B but a 0.4 pp increase in error rate at 20% canary. p95 latency is stable. Do you proceed to 50%? Explain your decision and what you will monitor in the next step.
Learning path
- Before: Containerize and version models; set up observability (logs/metrics/traces).
- Now: Canary and A/B releases for controlled rollouts.
- Next: Automated rollbacks, progressive delivery pipelines, online evaluation frameworks.
Next steps
- Write a standard operating procedure (SOP) template for rollouts, including guardrails and rollback rules.
- Create reusable routing configs for canary and A/B with stable bucketing.
- Automate a nightly report comparing A vs B on key metrics and segments.
Progress & test
Take the quick test to check your understanding. The test is available to everyone. If you are logged in, your progress will be saved automatically.