Why this matters
Vision systems in production face changing conditions: lighting shifts, new camera models, seasonal scenes, and code updates. Incident response for degradation ensures your detection/segmentation/classification stays trustworthy when quality drops. Real tasks you will do:
- Interpret alerts (e.g., mAP, IoU, precision/recall, latency) and set severity.
- Contain the issue fast (rollback, switch to fallback model, route to human review).
- Communicate status and timelines to stakeholders.
- Run root-cause analysis (RCA) and define prevention actions.
Concept explained simply
Degradation is a measurable drop in quality or reliability of your vision system relative to its Service Level Objectives (SLOs). It can be caused by data drift, concept drift, software bugs, model regressions, or infrastructure issues.
What counts as degradation?
- Quality: mAP/IoU/precision/recall falls below SLO or confidence score distribution shifts.
- Service: latency spikes, error rate rises, GPU/CPU saturation increases.
- Coverage: more uncertain predictions, higher abstain rate, or missing detections.
Mental model: a thermostat for model quality
Think of your monitoring as a thermostat. When the temperature (metrics) drifts from the setpoint (SLO), the incident process brings it back: detect, triage, contain, recover, and prevent recurrence.
The incident lifecycle (step cards)
- Detect — Alerts from metrics like mAP, precision/recall, IoU, confidence shift, PSI/KL on features, latency, error rate.
- Triage — Set severity (SEV1 to SEV3). SEV1 = major impact now; SEV2 = moderate/widespread risk; SEV3 = minor/localized.
- Communicate — Announce incident channel, owner, ETA, impacted services/cameras.
- Contain — Roll back model/preprocess, switch to last good version, reduce traffic to canary, enable human-in-the-loop, or degrade gracefully.
- Verify — Confirm metrics return to SLO and stabilize.
- Recover — Restore full capacity, close incident after monitoring window.
- RCA & Prevent — Document cause, add tests/monitors, update runbooks.
Worked examples
Example 1 — Lighting change on a subset of cameras
- Signal: Confidence scores drop by 20%, recall falls from 0.82 to 0.67 on cameras 23–40; latency normal; deployment unchanged.
- Triage: SEV2 (localized but user-impacting).
- Containment: Apply per-camera exposure normalization preset; temporarily route these cameras to previous model with better robustness.
- Verification: mAP returns from 0.55 to 0.61 (SLO ≥ 0.60), stable over 60 minutes.
- RCA: Facility installed brighter LEDs; dynamic range shift. Action: add histogram-based drift monitor and robust training augmentations for high-contrast scenes.
Example 2 — Bad retrain caused regression
- Signal: mAP drops from 0.64 to 0.49 globally minutes after deploy; latency unchanged; confidence distribution narrows.
- Triage: SEV1 (global, immediate impact).
- Containment: Roll back to previous model; freeze canary. Announce SEV1 and give 15-min updates.
- Verification: mAP returns to 0.64; alerts clear.
- RCA: Label noise in new dataset slice; missed in offline eval due to skewed validation split. Actions: add stratified validation, slice metrics, and a deployment guard failing if slice mAP drops >3%.
Example 3 — Preprocessing bug after library update
- Signal: IoU drops on segmentation models; only images resized by new path affected; GPU utilization spikes.
- Triage: SEV2 (widespread, partial).
- Containment: Pin previous image library version; restart workers with known-good container.
- Verification: IoU recovers to baseline; GPU utilization normalizes.
- RCA: Interpolation default changed. Prevention: lock preprocessing versions and add a canary image golden-test comparing pre/post tensors.
Runbooks you can reuse
Severity checklist (quick triage)
- SEV1: Global drop beyond SLO or safety-critical miss. Action: rollback/switch to last known good immediately.
- SEV2: Localized or partial drop with noticeable impact. Action: segment traffic, hotfix preprocess or fallback model.
- SEV3: Minor drift or early warning. Action: monitor closely, schedule fix, expand canary tests.
Containment options
- Route N% traffic to previous model or rules-based fallback.
- Disable risky augmentations/preprocessing change; pin versions.
- Activate human review for low-confidence predictions.
- Reduce batch size or autoscale to meet latency SLO while investigating.
Communication template
Incident: [SEV#] Vision degradation detected at [time]. Owner: [name]. Scope: [cameras/regions/services]. Impact: [what users see]. Next update: [15/30 min]. Mitigation: [rollback/fallback].
Exercises
Do these to internalize the workflow. After you finish, compare with the solutions below or take the quick test.
Exercise 1 — Triage and contain
An alert shows: mAP fell from 0.62 to 0.45 in 20 minutes across 60% of cameras. Latency is normal. SLO: mAP ≥ 0.55 at p95. A new model was deployed 15 minutes ago to 100% traffic. Decide severity, your first three actions, and your success criteria.
Suggested solution
- Severity: SEV1.
- Actions: 1) Roll back to prior model immediately; 2) Announce incident and owner; 3) Freeze deploys and start RCA (check dataset/version diffs).
- Success: mAP ≥ 0.55 sustained for 60 minutes with normal confidence distribution.
Exercise 2 — Draft a mini runbook
Create a 7-step runbook for a suspected preprocessing regression after an image library update. Include detection signals, containment, verification, and prevention.
Suggested solution
- Confirm alert source (IoU drop, GPU spike).
- Identify recent preprocessing/library changes.
- Contain: pin previous version; restart workers.
- Golden-test: compare tensors on a fixed image set.
- Verify recovery (IoU, latency) over 60 minutes.
- Postmortem: document interpolation default change.
- Prevention: lock versions; add CI golden-tests and canary.
Checklist for any incident note:
- Owner and start time
- Impact and scope
- Hypotheses tried
- Mitigations applied
- Verification metrics and window
- Next update time
Common mistakes and self-checks
- Mistake: Chasing noise. Self-check: Is the change statistically significant and outside SLO for a sustained window?
- Mistake: Fixing without communicating. Self-check: Did you name an owner and set update cadence?
- Mistake: Rolling forward blindly. Self-check: Do you have a known-good version to revert to?
- Mistake: Ignoring slice metrics. Self-check: Did you inspect per-camera/scene slices?
- Mistake: Closing without prevention. Self-check: What monitor/test prevents this recurrence?
Practical projects
- Build a mock incident dashboard: ingest synthetic metrics (mAP, latency, confidence histograms) and trigger alerts when thresholds are crossed.
- Create a canary validation pack: 50 diverse images with expected outputs to sanity-check preprocessing and model versions.
- Write two runbooks: (1) model regression after retrain, (2) camera/lighting drift affecting a subset of feeds.
Who this is for
- Computer Vision Engineers shipping models to production.
- MLOps/Platform engineers supporting real-time vision services.
- Tech leads responsible for quality and SLOs in ML products.
Prerequisites
- Basic understanding of CV metrics (mAP, IoU, precision/recall) and latency/throughput.
- Familiarity with versioning models and containers.
- Ability to interpret monitoring dashboards and logs.
Learning path
- Set SLOs and alerts for your vision system.
- Define severity levels and on-call roles.
- Create rollback/fallback mechanisms and test them.
- Practice incident drills using the worked scenarios.
- Add prevention: slice tests, golden images, version locks.
Take the Quick Test
This test is available to everyone. Only logged-in users will see saved progress and stats.
Next steps
- Operationalize runbooks in your team’s on-call rotation.
- Schedule quarterly incident drills with real metrics.
- Continuously refine SLOs and alert thresholds based on observed variance.
Mini challenge
Design a one-page runbook for “SEV2: day/night switch causes recall drop on 30% cameras.” Include: detection signals, first actions (within 10 minutes), communication template, verification criteria, and two prevention items. Keep it executable and time-boxed.