Why this matters
Machine learning systems fail in ways classic software doesn’t: data drift, degraded model quality, silent bias, broken features, and misbehaving online learners. A clear incident response plan reduces customer impact, protects revenue, and shortens recovery time.
- Real tasks you will do: set up model SLOs, triage alerts, execute rollback, coordinate cross-team comms, run post-incident review, and add guardrails to prevent repeats.
- Common stakes: incorrect predictions driving losses, unsafe outputs, compliance risk, and reputation damage.
Concept explained simply
Incident response for ML is a repeatable playbook you run when model behavior or the ML platform significantly deviates from expected performance or safety.
Mental model
- Think “smoke alarms + fire drill” for models: monitors detect smoke, you triage the fire, contain spread, restore normal service, then fireproof the house.
- Core loop: Detect → Triage → Contain → Recover → Verify → Learn.
What makes ML incidents special?
- Quality can degrade silently (no crashes, just worse predictions).
- Data-dependent failures (drift, upstream schema changes).
- Ethical/safety incidents (bias spikes, toxic LLM outputs).
- Coupled pipelines (feature store, retraining, deployment) increase blast radius.
Incident types, severities, and SLOs
Common ML incident types
- Data/feature issues: missing features, wrong ranges, schema drift, staleness.
- Model quality: accuracy drop, fairness regressions, concept drift.
- Serving/platform: latency spikes, timeouts, resource exhaustion, autoscaling issues.
- Training/ETL: failed jobs, corrupted artifacts, bad hyperparams pushed to prod.
- Safety/abuse: adversarial inputs, prompt injection, toxic outputs, PII leakage.
Severity (suggested)
- SEV1: Broad customer/business impact; unsafe or legally risky outputs.
- SEV2: Noticeable impact on KPIs; partial degradation.
- SEV3: Limited scope or early detection; workarounds exist.
ML SLO examples
- Prediction SLOs: p95 latency ≤ target; availability ≥ target.
- Quality SLOs: monitor proxy metrics (AUC, win rate, calibration error, PSI for drift) within thresholds.
- Safety SLOs: toxic or unsafe response rate ≤ threshold.
Runbook essentials
- Pager/ownership: who is on-call; escalation paths.
- Decision table: when to rollback, fail open/closed, throttle, or switch to fallback model.
- Checklists: triage steps, containment actions, verification steps.
- Comms templates: internal updates and customer-friendly status notes.
Worked examples
Example 1: Sudden accuracy drop after a holiday (data drift)
- Detect: Drift monitor flags PSI = 0.35 (threshold 0.2); online AUC drops 5 points.
- Triage: Confirm feature distributions; segment impact (holiday traffic skew).
- Contain: Route 50% traffic to last-known-good model; increase sampling/guardrails.
- Recover: Hotfix with segmentation rule or retrain with recent data.
- Verify: Compare lift vs baseline; ensure drift back under threshold.
- Learn: Add seasonal features; update retraining cadence pre-holidays.
Example 2: Feature store outage causing timeouts
- Detect: p95 latency doubles; elevated 5xx from model service.
- Triage: Error logs show feature fetch failures; downstream dependent service.
- Contain: Switch to cached features or default values; temporarily reduce feature set.
- Recover: Coordinate platform fix; warm caches; scale serving pods.
- Verify: Latency and error rates return to SLO; quality proxy stable.
- Learn: Add circuit breaker; graceful degradation path; synthetic checks.
Example 3: LLM prompt injection producing unsafe content
- Detect: Safety monitor alerts on toxic rate > threshold; user reports.
- Triage: Reproduce with sample prompts; confirm safety filter failures.
- Contain: Enable stricter moderation; add rule-based guardrails; temporarily block risky tools.
- Recover: Update system prompt and safety policies; fine-tune or swap to safer model variant.
- Verify: Red-team test set passes; toxic rate below threshold.
- Learn: Add attack pattern monitors; pre-release red-teaming; canary before full rollout.
First 90 minutes: ML incident playbook
- Page and assign roles: Incident Commander (IC), Ops, ML Owner, Comms.
- Define severity: Use impact to users/KPIs/safety; set SEV and timestamps.
- Stabilize first: Rollback to stable model or enable fallback rules. Prefer safe defaults over perfect predictions.
- Narrow scope: Identify affected segments, features, or endpoints; canary containments.
- Form hypothesis: Data drift? Upstream change? New model? Infrastructure?
- Gather evidence: Dashboards for latency, error rate, drift (PSI/JS), business KPIs, logs, recent deploys.
- Recover: Retrain, revert config, restore features, patch prompts/filters.
- Verify: Check SLOs, shadow tests, A/B or backtests; remove temporary throttles.
- Communicate: Regular updates with impact, actions, ETA, and next check-in.
- Record: Timeline, decisions, metrics for post-incident review.
Exercises
Practice these to build reflexes. Then take the quick test at the end. Everyone can take the test; only logged-in users get saved progress.
Exercise 1 (ex1) — Draft a 1-page ML incident runbook
- Timebox: 25 minutes.
- Use your current or an imaginary ML service (e.g., churn model, ranking, LLM assistant).
- Include: ownership, SLOs, severity map, detection triggers, first-hour checklist, rollback/fallback rules, verification, comms template.
Template hints
- Ownership: On-call, escalation contacts.
- SLOs: availability, latency, quality proxies, safety thresholds.
- Triggers: drift PSI > X, AUC drop > Y, toxic rate > Z, error rate > A.
- First-hour: stabilize → scope → evidence → comms.
- Fallbacks: last-known-good model, rules-only, cached features.
- Verify: metrics back within thresholds; canary OK.
Exercise 2 (ex2) — Triage a drift alert
Given: PSI(feature_price)=0.32 (thr 0.2), p95 latency stable, error rate normal, A/B shows -3% conversion in mobile users only, last deploy 2 hours ago (feature normalization change).
- Decide SEV level and immediate containment.
- List top 3 diagnostic checks.
- Pick a rollback or patch approach and define verification metric.
Tips
- Prioritize business impact and scope (mobile segment).
- Recent deploys are prime suspects.
- Rollback fast; fix forward after stabilization.
Common mistakes and self-check
- Chasing perfect root cause before stabilizing service. Fix: rollback first.
- No quality proxies in prod. Fix: add drift, calibration, and safety monitors.
- Confusing infrastructure latency with model quality issues. Fix: separate dashboards and alerts.
- Over-broad changes during incidents. Fix: one change at a time; document.
- Poor comms cadence. Fix: schedule updates (e.g., every 30 minutes) with impact and ETA.
Self-check
- Can you restore safe, predictable behavior within 60 minutes?
- Do you have clear rollback/fallback for every critical model?
- Are SLOs and thresholds defined and alerting?
- Can you run a tabletop drill end-to-end in under 45 minutes?
Practical projects
- Build a drift-and-safety dashboard with synthetic canaries; set actionable thresholds.
- Create a full runbook for one production model, including comms templates and decision tables.
- Run a 60-minute tabletop: simulate drift, execute rollback, post-incident review with action items.
- Implement a fallback path: last-known-good model + minimal feature set + rules-only mode.
Who this is for, prerequisites, and learning path
Who this is for
- Machine Learning Engineers, Data Scientists on-call, MLOps/Platform Engineers, and Product Engineers who own ML features.
Prerequisites
- Basic model evaluation (AUC, accuracy, calibration).
- Understanding of your serving stack and deployment process.
- Access to monitoring dashboards and logs (or mock equivalents).
Learning path
- Start with SLOs and alert thresholds.
- Define rollback/fallback mechanisms.
- Write a runbook; run a tabletop drill.
- Automate: canaries, circuit breakers, safety filters.
- Institutionalize: post-incident reviews and recurring checks.
Mini challenge
You receive: AUC -7% on desktop traffic; drift PSI high on feature_city; p95 latency OK; last content campaign launched 3 hours ago. In 8 bullet points, outline your first 30 minutes: assign roles, set SEV, immediate containment, top diagnostics, rollback plan, comms note, verification metrics, and next checkpoint.
Next steps
- Finish the exercises and ensure your 1-page runbook is ready.
- Run a 30-minute tabletop with a teammate.
- Take the quick test below to lock in concepts. Everyone can take it; only logged-in users get saved progress.
Quick Test
Answer a few questions to check your readiness. Passing score: 70%.