luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Rollback Strategy

Learn Rollback Strategy for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Why this matters

Rollback toolkit

  • Deployment styles:
    • Shadow: send a copy of traffic to the new model, but don’t affect users. Safest pre-check.
    • Canary: gradually shift traffic (e.g., 1% → 5% → 25% → 50% → 100%). Rollback just sends traffic back.
    • Blue/Green: two environments (blue=live, green=new). Flip the router. Rollback is flipping back.
  • Versioning: immutable image tags (e.g., model:v1.3.2), model registry entries, feature pipeline versions.
  • Guardrails: SLOs, anomaly alerts, business KPIs, input-data quality checks, and circuit breakers.
  • Traffic controls: weighted routing, session stickiness, per-region rollouts.
  • Runbooks: one-page step list, owner, triggers, commands, and verification checks.
Common rollback triggers (expand)
  • Latency: P95 or P99 up by > X% for Y minutes.
  • Error rate: 5xx or timeouts above threshold.
  • Prediction quality proxy: business metric drop (e.g., CTR, conversion, acceptance-rate).
  • Data issues: schema mismatch, null spikes, distribution shift beyond bounds.

Worked examples

Example 1: Canary rollback

  1. Start at 5% traffic to v2 while 95% stays on v1.
  2. Monitors show P95 latency +40% for 6 minutes (SLO breach).
  3. Action: set traffic to 0% v2, 100% v1 (instant rollback).
  4. Post-rollback: freeze v2 image and inputs; open incident; capture logs and sample requests; schedule root-cause analysis.

Example 2: Blue/Green with feature change

  1. Green has model v2 and new feature encoder v5.
  2. Flip router to Green. Business KPI drops −3% in 10 minutes with high confidence.
  3. Rollback steps:
    • Flip back to Blue (model v1 + encoder v4).
    • Disable feature flag for encoder v5 to prevent accidental reuse.
    • Note: model and features are rolled back together to maintain compatibility.

Example 3: Shadow before risky rollout

  1. Run v2 in shadow mode for 24 hours, mirror 10% of traffic.
  2. Compare outputs v2 vs. v1. Drift spikes for a key segment at night.
  3. Decision: do not canary yet; fix night-segment feature before user-facing rollout (no rollback needed).

Minimal rollback runbook

1. Trigger
When: P95 latency +30% 5m OR error rate >5% 3m OR KPI −2% 15m with significance.
2. Action
Route 100% traffic to last-known-good (v1). Keep v1 warm to avoid cold starts.
3. Verify
Confirm SLOs recover within 10 minutes. Check dashboards, logs, and sample predictions.
4. Communicate
Post an incident update in the team channel, note start/end times and impact summary.
5. Preserve evidence
Freeze v2 artifacts (image digest, model registry version, feature versions) and copy key request/response samples.
6. Follow-up
Root cause, add a test/alert to catch it earlier, and plan a safer re-release.
Operational tips
  • Always keep the previous version deployed and healthy (warm standby).
  • Use immutable digests, not mutable tags.
  • Tie model and feature versions together in config.
  • Automate rollback on severe SLO breaches; manual approval for mild ones.

Common mistakes and self-check

  • No clear trigger thresholds → Self-check: can someone new to the team run the rollback without guessing?
  • Rolling back model but not features → Self-check: list current model, feature pipeline, and configs; are they version-paired?
  • Cold rollback → Self-check: is last-known-good instance warm and health-checked?
  • Missing post-rollback verification → Self-check: what specific metrics prove recovery?
  • Deleting evidence → Self-check: are artifacts and samples archived after incident?

Exercises

Note: Everyone can do the exercises and the quick test for free. Log in to save your progress.

  1. Exercise 1 — Design a canary rollback plan
    Create a 5%→25%→50%→100% rollout with explicit rollback triggers and verification. Include which metrics you track and thresholds.
    Deliverable: a short runbook snippet.
    Time: 15–20 minutes.
  2. Exercise 2 — Write a rollback runbook snippet
    Write exact commands or steps your router/orchestrator would need to shift traffic back to v1, and what to check after rollback.
    Deliverable: step-by-step list.
  • [ ] I set clear triggers based on SLOs and KPIs
  • [ ] I keep last-known-good warm
  • [ ] I verify with both system and business metrics
  • [ ] I freeze artifacts and capture evidence
  • [ ] I define owner and communication steps

Practical projects

  • Project A: Build a canary controller config that shifts 1%→5%→25% traffic and auto-rolls back on SLO breach.
  • Project B: Create a blue/green setup for a model service with a one-click flip and a post-rollback verification job.
  • Project C: Implement shadow mode replay of 5% traffic and an output diff report to block risky rollouts.

Learning path

  • Before: Containerizing models; health checks; monitoring basics.
  • Now: Rollback strategy (this lesson).
  • Next: Automated promotion policies; chaos testing for ML services; drift detection.

Next steps

  • Templatize your rollback runbook.
  • Add alert rules that map 1:1 to rollback triggers.
  • Practice a game-day: simulate a bad release and execute the rollback.

Mini challenge

Your new model needs a new categorical encoding. Design a rollout where you can rollback the model without breaking due to encoder mismatch. Include how you would version and gate the encoder and how you would verify after rollback.

Practice Exercises

2 exercises to complete

Instructions

Create a canary rollout plan for model v2 replacing v1. Start at 5% traffic and aim for 100%. Define:

  • Exact traffic steps and dwell times
  • Metrics and thresholds (latency, error rate, KPI)
  • Rollback command or action
  • Post-rollback verification checks
Format suggestion
Steps: 5% (15m) → 25% (30m) → 50% (60m) → 100%
Metrics: P95 latency, 5xx rate, KPI (e.g., conversion)
Triggers: ...
Rollback: ...
Verify: ...
Expected Output
A concise runbook-like plan showing traffic ramp, SLO/KPI thresholds, rollback action, and verification steps.

Rollback Strategy — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Rollback Strategy?

AI Assistant

Ask questions about this tool