Who this is for
Machine Learning Engineers and Data/ML platform practitioners who run online inference services or data APIs and need reliable, low-latency predictions at scale.
Prerequisites
- Basic understanding of web services (HTTP, requests, responses)
- Familiarity with ML inference (batch vs. real-time)
- Comfort with metrics like counters, gauges, and histograms
Learning path
- Grasp core definitions: latency, throughput, error rate, percentiles.
- Build a mental model: how these metrics interact in production.
- Instrument your service: add counters and histograms with clear labels.
- Design dashboards: visualize trends and tails.
- Set SLOs and alerts: choose thresholds, windows, and burn rates.
- Practice: size capacity, simulate spikes, tune alerting.
Why this matters
In real teams, you will:
- Keep prediction APIs fast under load (shopping recommendations, fraud checks, search ranking).
- Catch regressions after a model or feature change before users feel pain.
- Plan capacity for launches and seasonal traffic.
- Balance speed, cost, and accuracy.
Typical ML production tasks
- Decide how many workers/replicas you need for a new traffic forecast.
- Investigate a spike in P99 latency without harming P50.
- Set an alert so on-call is paged only when the user experience is truly at risk.
Concept explained simply
Key definitions
- Latency: time from request to response. Look at percentiles like P50 (median), P95, P99 to capture tail behavior.
- Throughput: requests processed per second (RPS). Often tracked as successful RPS and total RPS.
- Error rate: fraction of requests that fail from a system perspective (timeouts, 5xx, validation failures). Do not confuse with model accuracy.
Mental model
Think of your service like a checkout line: arrivals (throughput) join the line, are served (latency), and sometimes a checkout fails (error). If arrivals exceed serving capacity, the line grows, latency rises, and timeouts increase, causing errors. Improve capacity or reduce work per request to keep the line short.
Helpful rule of thumb
Concurrency β arrival_rate Γ average_latency. If you expect 200 RPS and average latency is 100 ms (0.1 s), you need about 20 concurrent workers just to keep up.
Why percentiles beat averages
- Averages can hide pain. Users who hit the tail (P95/P99) feel slowness.
- Track distributions with histograms and show P50/P90/P95/P99.
- Alert on tail percentiles for user-facing SLOs.
How they interact
- Higher throughput with fixed capacity increases queueing, raising latency and eventually error rate (timeouts).
- Optimizing latency (e.g., batching, caching) can increase throughput headroom.
- Error rate often spikes after latency spikes; timeouts and circuit breakers trigger failures.
Worked examples
Example 1 β Capacity sizing from RPS and latency
Expected peak: 220 RPS. Average latency: 120 ms (0.12 s). Minimum concurrent workers = RPS Γ avg_latency = 220 Γ 0.12 = 26.4 β 27. Add 30% headroom: 27 Γ 1.3 β 35 workers (or equivalent replicas Γ threads).
What about percentiles?
If P95 is 2Γ average, assume 0.24 s for tail traffic. For safe sizing, consider that a portion of requests will take longer; headroom helps absorb that.
Example 2 β Latency budget across components
End-to-end P95 SLO: β€ 250 ms. Components:
- Gateway: 20 ms
- Feature store: 80 ms
- Model inference: 120 ms
- Post-processing: 20 ms
Total = 240 ms, leaving 10 ms margin. Action: add caching to feature store to shave 20β30 ms and create a comfortable buffer.
Example 3 β Error-rate alert using burn rate
Availability SLO: 99.5% over 30 days β error budget = 0.5% of requests. Observed 10-minute error rate = 5%. Burn over 10 minutes = 5% / 0.5% = 10Γ budget burn. Trigger a fast-page alert if burn β₯ 14Γ over 5β10 minutes, and a slower warning at β₯ 6Γ over 1 hour.
Why burn rate?
It ties alerts to how quickly you consume your monthly budget, reducing noisy paging on tiny blips.
Instrumentation and dashboards
- Record histograms for latency labeled by: endpoint, model_version, region, success/failure, and request_type (sync/batch).
- Track counters: total_requests, successful_requests, failed_requests (by error class: timeout, 5xx, 4xx-validation).
- Compute throughput: successful RPS and total RPS.
- Visualize: P50/P95/P99 latency, RPS, error rate by class, concurrency, and queue depth.
Dashboard layout idea
- Row 1 (Experience): P95 latency, error rate, successful RPS
- Row 2 (Capacity): concurrency, CPU/GPU, memory, queue length
- Row 3 (Details): latency histogram by endpoint and model_version
Alerting strategy
- Latency SLO: page on sustained P95 or P99 breach (e.g., P95 > 250 ms) using short (5β10 min) and long (1 h) windows to avoid flukes.
- Error rate: use multi-window burn-rate alerts (fast page for very high burn, slow warn for moderate burn).
- Throughput: warn on sudden drops (traffic loss) or sustained saturation (RPS near capacity and rising latency).
- Missing data: alert if metrics go silent (could mean an outage).
Good alert hygiene
- Page on symptoms users feel (tail latency, availability), not on CPU alone.
- Use runbooks: steps to roll back model version, scale replicas, warm caches.
Exercises
These match the interactive exercises below.
Exercise 1 β Size your inference service
- Given 220 RPS peak and 120 ms average latency, estimate minimum workers and a safe headroom target.
- Propose two actions if P99 spikes during traffic bursts.
- [ ] Compute minimum workers with the concurrency rule of thumb.
- [ ] Add headroom and round up.
- [ ] Suggest actions (e.g., autoscale, cache warmup).
Exercise 2 β Design SLOs and alerts
- Pick an end-to-end latency SLO and an availability SLO.
- Define two alert windows and burn-rate thresholds that reduce noise.
- List the exact metrics and labels you will use.
- [ ] Latency SLO picks a percentile and threshold.
- [ ] Availability SLO defines success criteria.
- [ ] Alert windows and burn thresholds documented.
- [ ] Metrics and labels enumerated.
Common mistakes and self-check
- Mistake: Using average latency only. Self-check: Do you display and alert on P95/P99?
- Mistake: Mixing model accuracy with system error rate. Self-check: Are failures broken down by HTTP/status and timeout vs. bad prediction?
- Mistake: No labels. Self-check: Can you filter metrics by model_version, endpoint, and region?
- Mistake: Paging on CPU alone. Self-check: Are user-centric SLOs your primary paging signals?
- Mistake: Not accounting for traffic spikes. Self-check: Do you have headroom and autoscaling policies?
Practical projects
Project 1 β Build the golden signals dashboard
- Instrument latency histograms and request counters with labels.
- Create panels: P50/P95/P99 latency, successful RPS, error rate by class, queue depth.
- Add breakdowns by model_version and endpoint.
- [ ] Latency percentiles shown
- [ ] Error classes separated
- [ ] Filters for version/endpoint
Project 2 β SLOs and alerting
- Define latency and availability SLOs.
- Implement multi-window burn-rate alerts (e.g., 5β10 min page, 1 h warn).
- Add runbook notes to your alerts.
- [ ] SLOs documented
- [ ] Alerts tested with synthetic failures
- [ ] Runbook linked in alert description
Project 3 β Load test and tune
- Run a step-load to target peak RPS.
- Watch percentiles, concurrency, and errors.
- Apply a fix (batching, cache, more replicas) and compare before/after.
- [ ] Step-load results recorded
- [ ] Bottleneck identified
- [ ] Improvement quantified
Mini challenge
You expect 300 RPS. Average latency is 80 ms and P95 is 160 ms. Compute minimum workers and a safe target with 40% headroom. If timeouts begin at 500 ms, which percentile should you monitor to catch user impact earliest?
Reveal answer
Min workers β 300 Γ 0.08 = 24 β 24. With 40% headroom: 24 Γ 1.4 = 33.6 β 34 workers. Monitor at least P95; consider P99 if timeouts appear near 500 ms so you see tail growth early.
Next steps
- Implement histograms and counters if you have not already.
- Draft latency and availability SLOs and review with your team.
- Set up multi-window alerts and a simple runbook.
- Take the Quick Test below. The test is available to everyone; only logged-in users will have their progress saved.