Why this matters
Rollback toolkit
- Deployment styles:
- Shadow: send a copy of traffic to the new model, but don’t affect users. Safest pre-check.
- Canary: gradually shift traffic (e.g., 1% → 5% → 25% → 50% → 100%). Rollback just sends traffic back.
- Blue/Green: two environments (blue=live, green=new). Flip the router. Rollback is flipping back.
- Versioning: immutable image tags (e.g., model:v1.3.2), model registry entries, feature pipeline versions.
- Guardrails: SLOs, anomaly alerts, business KPIs, input-data quality checks, and circuit breakers.
- Traffic controls: weighted routing, session stickiness, per-region rollouts.
- Runbooks: one-page step list, owner, triggers, commands, and verification checks.
Common rollback triggers (expand)
- Latency: P95 or P99 up by > X% for Y minutes.
- Error rate: 5xx or timeouts above threshold.
- Prediction quality proxy: business metric drop (e.g., CTR, conversion, acceptance-rate).
- Data issues: schema mismatch, null spikes, distribution shift beyond bounds.
Worked examples
Example 1: Canary rollback
- Start at 5% traffic to v2 while 95% stays on v1.
- Monitors show P95 latency +40% for 6 minutes (SLO breach).
- Action: set traffic to 0% v2, 100% v1 (instant rollback).
- Post-rollback: freeze v2 image and inputs; open incident; capture logs and sample requests; schedule root-cause analysis.
Example 2: Blue/Green with feature change
- Green has model v2 and new feature encoder v5.
- Flip router to Green. Business KPI drops −3% in 10 minutes with high confidence.
- Rollback steps:
- Flip back to Blue (model v1 + encoder v4).
- Disable feature flag for encoder v5 to prevent accidental reuse.
- Note: model and features are rolled back together to maintain compatibility.
Example 3: Shadow before risky rollout
- Run v2 in shadow mode for 24 hours, mirror 10% of traffic.
- Compare outputs v2 vs. v1. Drift spikes for a key segment at night.
- Decision: do not canary yet; fix night-segment feature before user-facing rollout (no rollback needed).
Minimal rollback runbook
1. Trigger
When: P95 latency +30% 5m OR error rate >5% 3m OR KPI −2% 15m with significance.
When: P95 latency +30% 5m OR error rate >5% 3m OR KPI −2% 15m with significance.
2. Action
Route 100% traffic to last-known-good (v1). Keep v1 warm to avoid cold starts.
Route 100% traffic to last-known-good (v1). Keep v1 warm to avoid cold starts.
3. Verify
Confirm SLOs recover within 10 minutes. Check dashboards, logs, and sample predictions.
Confirm SLOs recover within 10 minutes. Check dashboards, logs, and sample predictions.
4. Communicate
Post an incident update in the team channel, note start/end times and impact summary.
Post an incident update in the team channel, note start/end times and impact summary.
5. Preserve evidence
Freeze v2 artifacts (image digest, model registry version, feature versions) and copy key request/response samples.
Freeze v2 artifacts (image digest, model registry version, feature versions) and copy key request/response samples.
6. Follow-up
Root cause, add a test/alert to catch it earlier, and plan a safer re-release.
Root cause, add a test/alert to catch it earlier, and plan a safer re-release.
Operational tips
- Always keep the previous version deployed and healthy (warm standby).
- Use immutable digests, not mutable tags.
- Tie model and feature versions together in config.
- Automate rollback on severe SLO breaches; manual approval for mild ones.
Common mistakes and self-check
- No clear trigger thresholds → Self-check: can someone new to the team run the rollback without guessing?
- Rolling back model but not features → Self-check: list current model, feature pipeline, and configs; are they version-paired?
- Cold rollback → Self-check: is last-known-good instance warm and health-checked?
- Missing post-rollback verification → Self-check: what specific metrics prove recovery?
- Deleting evidence → Self-check: are artifacts and samples archived after incident?
Exercises
Note: Everyone can do the exercises and the quick test for free. Log in to save your progress.
- Exercise 1 — Design a canary rollback plan
Create a 5%→25%→50%→100% rollout with explicit rollback triggers and verification. Include which metrics you track and thresholds.
Deliverable: a short runbook snippet.
Time: 15–20 minutes. - Exercise 2 — Write a rollback runbook snippet
Write exact commands or steps your router/orchestrator would need to shift traffic back to v1, and what to check after rollback.
Deliverable: step-by-step list.
- [ ] I set clear triggers based on SLOs and KPIs
- [ ] I keep last-known-good warm
- [ ] I verify with both system and business metrics
- [ ] I freeze artifacts and capture evidence
- [ ] I define owner and communication steps
Practical projects
- Project A: Build a canary controller config that shifts 1%→5%→25% traffic and auto-rolls back on SLO breach.
- Project B: Create a blue/green setup for a model service with a one-click flip and a post-rollback verification job.
- Project C: Implement shadow mode replay of 5% traffic and an output diff report to block risky rollouts.
Learning path
- Before: Containerizing models; health checks; monitoring basics.
- Now: Rollback strategy (this lesson).
- Next: Automated promotion policies; chaos testing for ML services; drift detection.
Next steps
- Templatize your rollback runbook.
- Add alert rules that map 1:1 to rollback triggers.
- Practice a game-day: simulate a bad release and execute the rollback.
Mini challenge
Your new model needs a new categorical encoding. Design a rollout where you can rollback the model without breaking due to encoder mismatch. Include how you would version and gate the encoder and how you would verify after rollback.