What is Service Health Monitoring?
Service Health Monitoring is how you continuously watch the uptime, speed, errors, and resource use of ML-powered services and their data/model signals. The goal: detect problems early, alert the right people, and fix issues fast with clear runbooks.
Quick glossary
- SLI (Service Level Indicator): A measured metric (e.g., p95 latency, 5xx rate).
- SLO (Service Level Objective): A target for an SLI (e.g., p95 latency ≤ 300 ms for 99% of requests over 30 days).
- SLA (Service Level Agreement): A contractual commitment with penalties; set after SLO maturity.
- Golden Signals: Latency, Traffic, Errors, Saturation.
- Model Health: Data quality, drift, model performance (e.g., AUC, accuracy lagging label availability), fairness.
Why this matters
- Keep revenue-critical ML APIs responsive during traffic spikes.
- Catch silent failures like data schema changes that degrade model predictions.
- Reduce pager fatigue with meaningful alerts and clear runbooks.
- Build trust with product teams via reliable SLOs and transparent error budgets.
Concept explained simply
Think of your ML service like a clinic:
- SLIs are vital signs (heart rate = latency, oxygen = errors, etc.).
- SLOs are the healthy ranges you aim to keep.
- Alerts are the alarms that ring when vitals go outside safe bounds long enough to matter.
- Runbooks are the triage steps the on-call follows.
Mental model: three layers of health you monitor together.
- Platform: compute, memory, GPU/CPU saturation.
- Service: API uptime, latency, error rates, queue length.
- Model/data: drift, data quality checks, model performance proxies.
Active vs. passive monitoring
- Passive (white-box): Metrics, logs, traces emitted by your service.
- Active (black-box): Synthetic checks that call your endpoints from outside to verify user experience.
Core metrics and practical thresholds
- Availability SLI: successful requests / total requests. Example SLO: e99.9% over 30 days.
- Latency SLI: p95 and p99 end-to-end latency. Example SLO: p95 00 ms, p99 1200 ms.
- Error rate SLI: 5xx and model-timeouts per minute. Example SLO: 3.0% or lower sustained over 15 min.
- Saturation SLI: CPU/GPU utilization, memory usage. Example SLO: GPU < 85% over 95th percentile windows.
- Data quality SLIs: schema conformity, null ratio, categorical cardinality bounds.
- Drift SLIs: population stability index (PSI) or KL divergence within bounds. Example SLO: PSI < 0.2 for key features.
- Model proxy SLIs: acceptance rate, confidence distribution shape, shadow model delta. Example SLO: acceptance rate stable within 7% week-over-week.
Choosing alert thresholds
Alert on symptoms users feel (availability, latency), and page only for sustained or severe breaches. For model signals, route to data/ML owners with ticketing severity instead of paging unless it impacts business KPIs.
Worked examples
Example 1: Real-time fraud model API
- SLIs: p95 latency, availability, 5xx rate, model timeout rate, feature null ratio, PSI for transaction amount.
- SLOs: 99.95% availability; p95 300 ms; 5xx < 0.5% over 15 min; PSI < 0.2.
- Alerts: Page if availability < 99% for 10 min OR p95 > 600 ms for 15 min; create ticket if PSI > 0.2 for 1 hour.
- Runbook snippet: Roll to previous model version; warm extra replicas; validate feature pipeline schema.
Example 2: Nightly batch retraining job
- SLIs: job success status, duration, data schema check pass rate, model quality vs. last baseline (AUC delta).
- SLOs: success by 03:00 daily; duration < 90 min; schema checks 100% pass; AUC drop < 2%.
- Alerts: Ticket if job misses 03:00; page if 2 consecutive misses; ticket if AUC drop > 2%.
- Runbook snippet: Check upstream data partitions; revert to previous dataset; re-run with last stable feature set.
Example 3: Recommender service rollout
- SLIs: click-through rate (lagging), p95 latency, 5xx rate, cache hit ratio.
- SLOs: p95 500 ms; 5xx < 1%; cache hit > 80%.
- Alerts: Page if p95 > 1 s; ticket if cache hit < 70% for 30 min; notify product if CTR drops > 10% day-over-day.
- Runbook snippet: Increase cache TTL; scale read replicas; shift traffic back with canary weight.
Step-by-step: minimum viable monitoring
- Instrument: Expose metrics (latency histograms, error counters), traces for key endpoints, structured logs with request_id.
- Define SLIs: Start with golden signals + 2 model/data signals.
- Set SLOs: Pick realistic targets from last 30 days + 20% headroom.
- Dashboards: One top-level health view; one model/data deep-dive view.
- Alerts: Page on availability/latency; ticket on model/data anomalies; group alerts to avoid duplicates.
- Runbooks: For each alert, include probable causes, commands, rollback steps, and escalation contacts.
- Synthetic checks: External probes hitting health and inference endpoints with a representative payload.
Production tips
- Use time-based alert windows (e.g., 5 min) to avoid flapping.
- Include recent deploy ID and model version in logs and alert annotations.
- Track error budget burn rate to prioritize reliability work.
Exercises
These mirror the tasks below in the exercise panel. Do them here, then submit your answers in the exercises section.
Scenario: A sentiment analysis API handles 2k RPS daytime and 200 RPS nighttime. It returns JSON with a score and label. Occasional spikes happen after marketing campaigns.
- List 5 6 SLIs covering service and model/data.
- Propose realistic SLO targets with windows.
- Specify which alerts should page vs. create tickets.
Write one paging alert for latency and one ticketing alert for data drift. Add a short runbook (5 7 steps) that the on-call can follow.
Self-check checklist
- Do your SLIs map to user impact (availability, latency) and model integrity (drift, nulls)?
- Are SLOs achievable based on recent performance with headroom?
- Are alert windows long enough to avoid noise but short enough to catch real issues?
- Does your runbook include rollback and escalation?
Common mistakes and how to self-check
- Too many alerts: Leads to pager fatigue. Self-check: Can you explain why each alert exists and what action it triggers?
- No model-aware health: Service is up but predictions are wrong. Self-check: Do you track schema checks and drift?
- Alerting on raw spikes: No time window or burn-rate. Self-check: Use rolling windows and rate-of-breach (e.g., burn rate alerts).
- Missing runbooks: On-call scrambles. Self-check: Every alert links to a one-page runbook.
- Ignoring dependency health: Feature store or upstream API failing. Self-check: Add dependency SLIs (latency, error rate, freshness).
Practical projects
- Build a Health Dashboard: Create a single page that shows golden signals, top errors, and two model/data signals for one service.
- Drift Watchdog: Implement a daily PSI check on top 3 features with thresholded notifications.
- Synthetic Probe: A cron/scheduled job that calls your inference endpoint with a known payload, validating latency, status, and basic output shape.
Mini challenge
You are rolling out a new translation model. Choose 3 SLIs and 2 alerts that give early warning of user-impacting issues within 15 minutes of a bad deploy. Write one sentence per SLI explaining why it matters.
Hint
Think latency p95, availability, error rate, and one proxy model metric such as output empty-rate or confidence distribution shift.
Who this is for
- MLOps engineers running model services in production.
- Data/ML engineers responsible for pipelines and batch jobs.
- Backend engineers integrating ML APIs.
Prerequisites
- Basic understanding of HTTP services and status codes.
- Familiarity with metrics (counters, gauges, histograms) and logs.
- Knowledge of your model inputs/outputs and key business KPIs.
Learning path
- Before: Logging & Tracing fundamentals; Model Validation & Data Quality checks.
- Now: Service Health Monitoring (this lesson) with SLIs/SLOs and alerting.
- Next: Incident response, postmortems, and reliability budgeting.
Next steps
- Complete the exercises and take the quick test below to check your understanding.
- Implement one dashboard and one paging alert in a sandbox service.
- Write or refine one runbook and share it with your team.
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.