Why this matters
In production, great models fail if responses are slow or the service can’t handle load. As a Machine Learning Engineer, you’ll routinely:
- Hit strict SLAs like p95 < 120 ms for real-time inference.
- Handle traffic spikes (e.g., promo campaigns) without timeouts.
- Balance batching on GPUs with user-facing latency.
- Scale instances and tune concurrency safely.
- Diagnose bottlenecks across network, CPU, GPU, and storage.
Concept explained simply
Latency is how long one request takes end-to-end. Throughput is how many requests per second the system completes. You can trade one for the other via batching and concurrency, but only up to your hardware limits.
Mental model: Think of your service as a pipeline plus a waiting line.
- Pipeline stages: network in → deserialize → preprocess → model → postprocess → serialize → network out.
- Latency = queue time + service time. If the line is long (high utilization), queue time dominates.
- Throughput ≈ concurrency ÷ service_time.
Useful rules of thumb
- Little’s Law: L = λ × W. If arrival rate (λ) approaches capacity, W (waiting/latency) explodes.
- Tail latency (p95/p99) grows rapidly near 70–80% utilization. Keep headroom.
- Batching boosts throughput on accelerators but adds wait time. Use a max_wait_ms cap.
Key formulas and targets
- Latency per request: latency = queue_time + service_time.
- Throughput (approx): throughput ≈ concurrency / service_time_seconds.
- Capacity planning: target_utilization ≤ 0.7–0.8 to protect tail latency.
- Batching on GPUs (simple model): service_time(b) ≈ fixed_overhead + per_item_time × b.
Example targets you might set
- SLA: p95 < 120 ms, error rate < 0.1%.
- Headroom: maintain utilization ≤ 70% at expected peak.
- Batching: max_batch_size tuned to keep p95 within SLA; max_wait_ms ≤ 20 ms for UX.
Worked examples
Example 1: Estimating throughput vs concurrency
Assume average service time per request is 40 ms (0.04 s), and you set concurrency = 4 (e.g., 4 worker threads).
- Throughput ≈ concurrency / service_time = 4 / 0.04 = 100 RPS.
- At 100 RPS arrival, utilization ≈ 100% → tail latency spikes. To keep p95 healthy, target 70 RPS.
Takeaway
Don’t plan capacity at 100% utilization. Use 70–80% to protect p95/p99.
Example 2: GPU batching trade-off
Model on GPU: service_time(b) ≈ 6 ms + 4 ms × b. SLA p95 < 120 ms.
- For b = 8 → service_time = 6 + 32 = 38 ms.
- If max_wait_ms = 20 ms, worst-case per-item latency ≈ 38 + 20 = 58 ms (ignoring network). Meets SLA.
- Throughput (single GPU) ≈ b / service_time = 8 / 0.038 ≈ 210 RPS.
Takeaway
Small batches (like 8) can keep latency low and dramatically increase throughput.
Example 3: Cold starts and autoscaling
If autoscaling spins up an instance with a cold-load time of 3 s, requests may queue or time out during warm-up.
- Mitigation: pre-warm (readiness gate) and keep a min number of warm instances during business hours.
- Benefit: avoids sudden p95 spikes caused by cold instances joining under load.
Takeaway
Autoscaling without pre-warm strategies can worsen tail latency during spikes.
Techniques toolbox
Reduce service time
- Model optimizations: quantization, compilation (e.g., graph optimization), operator fusion.
- I/O cuts: avoid repeated model loads, cache tokenizers and pre/post artifacts.
- Serialization: prefer binary encodings for large payloads; avoid excessive JSON parsing.
Reduce queue time
- Scale out before saturation: target 60–80% utilization.
- Tune concurrency: right-size threads/workers and GPU streams.
- Prioritize critical requests; set fair timeouts and backpressure.
Batching heuristics
- Start small (e.g., max_batch_size 4–16) with max_wait_ms 10–20.
- Monitor p95, GPU utilization, and batch fullness; increase batch size gradually.
- Cap batch wait to protect latency when traffic is low.
Stabilize tail latency
- Time-box pre/post steps; cut heavy string ops.
- Isolate noisy neighbors: set CPU pinning or resource requests/limits.
- Use circuit breakers and strict timeouts.
Checklist: before you ship
- p50, p90, p95, p99 logged per stage and end-to-end.
- Load tested for expected peak × 1.2 headroom.
- Autoscaling tested with pre-warm and readiness probes.
- Batching tuned with max_wait_ms guarding UX.
- Concurrency validated (no oversubscription or thrashing).
Exercises
Everyone can take the exercises and the quick test. Note: only logged-in users will have progress saved.
Exercise 1: Tune batch size and queue wait
Goal: Maximize throughput on a single GPU while keeping p95 ≤ 120 ms.
- Given: service_time(b) ≈ 6 ms + 4 ms × b. Arrival rate ≈ 200 RPS. Choose max_batch_size and max_wait_ms.
- Assume p95 ≈ service_time(b) + max_wait_ms (ignore network for simplicity).
Hint
Try b in {4, 8, 12}. Keep p95 ≤ 120 ms with a reasonable max_wait_ms (10–25 ms range).
Exercise 2: Concurrency and autoscaling
Goal: Determine per-instance capacity and required instance count.
- Given: average service_time = 12 ms (0.012 s). To protect p95, use utilization cap = 70%.
- Set per-instance concurrency c = 2. Required steady load = 600 RPS with 20% headroom.
- Compute: per-instance RPS capacity and the number of instances to meet target capacity.
Hint
Throughput per instance ≈ c × (1 / service_time) × utilization_cap. Headroom: divide required RPS by 0.8 to size capacity.
Self-check checklist
- Did your chosen batch size keep p95 within the SLA?
- Is your headroom ≥ 20% above expected load?
- Did you avoid running at >80% utilization?
Common mistakes (and how to self-check)
- Planning at 100% utilization: Expect p95 to blow up. Self-check: Is target utilization ≤ 70–80%?
- Oversized batches when traffic is low: Users wait for batch fill. Self-check: Is max_wait_ms set small enough?
- Ignoring serialization: JSON marshalling can dominate small models. Self-check: Profile per stage.
- Autoscaling too late: Scale triggers on CPU only. Self-check: Use latency or queue length signals.
- Starving GPU with too few workers: Low utilization. Self-check: Increase concurrency or streams incrementally and monitor.
Practical projects
- Build a toy inference service with two configs: (A) no batching, (B) batching with max_wait_ms. Compare p95 under 50–200 RPS.
- Instrument per-stage timings in your pipeline and produce a flame chart or stacked bars.
- Run a load test script increasing RPS until p95 breaches SLA; derive safe operating point and headroom.
Who this is for
- Machine Learning Engineers deploying real-time inference.
- Backend/Platform engineers supporting ML services.
Prerequisites
- Basic HTTP/gRPC service knowledge.
- Comfort with concurrency and asynchronous processing.
- Familiarity with profiling and simple load testing.
Learning path
- Measure: add per-stage timing and p95/p99 metrics.
- Stabilize: set utilization targets, timeouts, backpressure.
- Optimize: model speedups, serialization, batching.
- Scale: tune concurrency and autoscaling policies.
- Harden: cold-start strategy, regression alarms.
Next steps
- Apply batching and max_wait_ms in a staging environment; verify p95.
- Add alarms on p95 and queue depth; rehearse spike playbooks.
- Iterate on concurrency until you reach stable tail latency and desired throughput.
Mini challenge
Your current p95 is 140 ms vs. a 120 ms SLA. GPU utilization is 45%, batch fullness is 3/8 on average, and CPU is 90% from JSON parsing. Propose two changes that bring p95 under SLA without adding instances.