Who this is for
- Machine Learning Engineers deploying models to production (batch or real-time).
- Data/ML platform engineers designing safe release pipelines.
- MLEs on-call for incidents who need fast, low-risk responses.
Prerequisites
- Basic CI/CD concepts (pipelines, artifacts, environments).
- Familiarity with model versioning and a model registry.
- Understanding of online/offline metrics and A/B or canary deployment basics.
Why this matters
Mental model
Imagine driving with two safety options: a reverse gear (rollback) and a clear detour route (rollforward). You pick whichever gets you to safety fastest with the least risk. Your dashboard (observability) and road rules (guardrails) tell you when to use which.
When to prefer rollback vs rollforward
- Rollback: Unknown root cause, severe metric drop, or broken compatibility. Goal: stabilize now.
- Rollforward: Known root cause with a small, tested fix ready. Goal: ship the fix quickly.
Core strategies
Deployment patterns
- Blue-Green: Keep two identical environments. Switch traffic to green only when healthy; switch back to blue to roll back.
- Canary: Send a small percent of traffic to the new version. Auto-rollback if guardrails breach.
- Shadow: New model runs in parallel without impacting users. Compare outputs. Easy rollback: keep shadow off.
- A/B: Split traffic to compare impact. Rollback = route all traffic to control.
Model toggles and config-driven selection
- Use a config flag or registry pointer (e.g., model_id=prod@1.14) to select the active model. Rollback = point back to the previous version.
- Keep artifacts immutable and reproducible. Never edit a deployed artifact in-place.
Data and schema changes
- Prefer backward-compatible schema changes (additive fields). Use feature flags for readers.
- For breaking changes, deploy reader/writer changes in stages. Rollback plan must include both code and data shape.
- Batch outputs: version the output tables or files. Rollback = republish previous version and update consumers to read it.
Guardrails and triggers
- Triggers: metric degradation (latency, error rate), business KPIs (CTR/revenue), data/feature drift, concept drift, incidents (timeouts, memory spikes).
- Guardrails: pre-agreed thresholds and time windows for automated action.
Example guardrails
- Error rate > 2% for 5 minutes → rollback.
- P95 latency +30% vs baseline for 10 minutes → rollback.
- CTR drop > 5% on canary with p-value < 0.05 for 30 minutes → rollback.
- Feature null-rate > 1% above baseline for 10 minutes → rollback.
Step-by-step playbooks
Playbook: Real-time model service
- Detect: Alert fires (e.g., error rate > 2%).
- Isolate: Check canary vs baseline dashboards and request logs.
- Decide: If root cause unclear → rollback. If known and small fix ready → rollforward.
- Execute rollback: Switch model_id to previous version or route traffic back to blue. Confirm health probes pass.
- Execute rollforward (if chosen): Deploy hotfix canary at 5%, validate guardrails, then ramp 5% → 25% → 50% → 100%.
- Verify: Watch metrics for at least one service-level objective window.
- Follow-up: Create incident note, root-cause, and permanent fix tasks.
Playbook: Batch predictions pipeline
- Detect: Data validation or QA checks fail; consumers report quality issues.
- Freeze: Stop downstream consumption by flipping a read pointer to last known-good partition/version.
- Rollback: Restore prior predictions (e.g., s3://predictions/v1.13/) and update consumers.
- Remediate: Fix pipeline or features; re-run with a new version id.
- Verify: Sample predictions; compare with baseline metrics; re-enable consumption.
- Document: Note impacted partitions and corrective actions.
Worked examples
Example 1: Canary hurts CTR
Situation: New ranking model at 10% canary shows -6.5% CTR vs control for 45 minutes.
- Action: Auto-rollback triggers based on guardrail. Traffic returns to control model v1.12.
- Verification: CTR returns to baseline within 10 minutes.
- Follow-up: Offline analysis finds feature scaling mismatch. Fix, retrain, and redeploy as v1.13 via canary again.
Example 2: Data drift at feature store
Situation: Real-time feature null-rate increases from 0.2% to 3% due to an upstream API change.
- Action: Rollforward with a small hotfix that provides a safe default and guards against nulls; deploy via canary.
- Fallback: If hotfix fails, rollback model to version robust to missing features.
- Outcome: Null-rate returns < 0.3%, no CTR impact.
Example 3: Batch job writes bad scores
Situation: Nightly credit risk scores shifted because of a one-hot encoding change.
- Action: Freeze consumers; repoint to previous partition; re-run batch with corrected transformer.
- Verification: Spot-check customers; compare KS-statistic to baseline.
- Outcome: Corrected scores published; incident closed with schema change policy update.
Exercises
Note: Exercises and the quick test are available to everyone; only logged-in users will have progress saved.
Exercise 1: Define rollback thresholds for a canary
Your service serves recommendations. Define numeric guardrails and a decision tree for rollback vs rollforward.
- Include: error rate, latency, CTR delta, and a minimum sample size/time window.
- Deliverable: a short checklist and the decision rule.
Exercise 2: Batch pipeline rollback/rollforward playbook
Draft a 6-step playbook to undo a bad batch publish caused by a schema change, then safely rollforward.
- Include: versioned outputs, read pointer switch, validation checks, and communication to consumers.
Self-check checklist
- Did you pick specific, measurable thresholds?
- Did you separate real-time vs batch patterns?
- Is the plan reversible and verifiable?
- Did you include communication and documentation steps?
Common mistakes and self-check
- No pre-agreed thresholds → slow, emotional decisions. Self-check: Are thresholds written and alerting configured?
- Unversioned artifacts → impossible rollback. Self-check: Are models, features, and outputs immutable and versioned?
- Coupled schema changes → cascading failures. Self-check: Are changes backward-compatible or staged?
- Skipping verification after rollback → hidden regressions. Self-check: Do you monitor a full window post-action?
- Manual, error-prone steps. Self-check: Are actions scriptable via configs or pipeline jobs?
Practical projects
- Implement a config-driven model switch: a small service that reads active model_id and swaps between v1 and v2. Include health checks and a dry-run mode.
- Create a canary pipeline stage: deploy at 5% with automated rollback if two guardrails breach for 10 minutes.
- Versioned batch outputs: publish predictions to versioned paths and build a read pointer switch job with validation gates.
Learning path
- Set up versioning for models, code, and data outputs.
- Add observability: dashboards for latency, errors, and key business metrics.
- Define guardrails and auto-rollback rules.
- Practice canary and blue-green in a staging environment.
- Write and rehearse your incident playbooks.
Next steps
- Automate your rollback/rollforward playbooks as pipeline jobs.
- Run a game day: simulate a regression and practice the response end-to-end.
- Expand guardrails to include data/feature quality signals.
Mini challenge
Design a one-page runbook for your team that fits on a single screen: triggers, actions, and verification steps for both rollback and rollforward. Keep it simple enough to use at 2am.
How progress and the test work
The quick test below is available to everyone. Only logged-in users will have results and progress saved to their account.