Topic Not Found

Why this matters

In production ML, user experience and cost depend on two levers: how fast one request completes (latency) and how many requests you can handle per unit time (throughput). As an MLOps Engineer, you will: set latency SLOs (e.g., p95 < 150 ms), size infrastructure to meet peak RPS, tune batching/concurrency on CPUs/GPUs, prevent tail-latency spikes, and scale replicas safely. Getting this right prevents dropped requests, timeouts, and runaway costs.

Who this is for

MLOps engineers deploying online inference services
Backend engineers integrating ML models
Data scientists moving prototypes to production

Prerequisites

Basic containers and networking (Docker, REST/gRPC)
Familiarity with model runtimes (e.g., ONNX Runtime, TensorRT, TorchServe, Triton)
Basic performance metrics: latency percentiles, RPS, CPU/GPU utilization

Concept explained simply

Think of your service like a highway. Latency is how long one car takes to get from on-ramp to off-ramp. Throughput is how many cars pass per minute. If you pack too many cars (requests) without expanding lanes (resources) or optimizing flow (batching, concurrency), traffic jams (queues) cause tail-latency spikes.

Mental model

Budget: end-to-end p95 latency = network + queue wait + preprocessing + model + postprocessing.
Little’s Law (steady state): concurrency ≈ RPS × average latency.
Capacity per replica: roughly 1 / latency per request × number of effective parallel workers.
Batching: increases per-batch latency a little but boosts items processed per run. Keep queue wait bounded.
Tail control: limit queueing and reduce variability to tame p95/p99.

Quick formulas and checks

Concurrency target: C ≈ RPS × (p50 or mean latency in seconds). Use p95 for conservative planning.
Per-replica capacity (approx): capacity ≈ workers / latency_s.
GPU batch latency (simple model): L(b) = L_fixed + L_transfer + b × L_item.
Queue wait bound: pick max queue delay target (e.g., 10–20 ms) and configure batcher timeouts accordingly.
Autoscaling hint: scale on a mix of p95 latency, in-flight requests, and RPS per replica; avoid CPU-only signals for GPU models.

Worked examples

Example 1 — CPU service capacity planning

Scenario: p95 latency target 120 ms. Expected peak 600 RPS. Node: 8 vCPU, single-model service, CPU-bound inference.

1) Required concurrency: C ≈ 600 × 0.12 = 72 in-flight requests system-wide.

2) Per-replica capacity: With 8 workers (1 per core), each request occupies a core for most of its 120 ms. Approx per-replica capacity ≈ 8 / 0.12 ≈ 67 RPS.

3) Replicas needed: 600 / 67 ≈ 8.96 → provision 9 replicas.

4) Worker config: 8 processes (one per core), 1 thread each for CPU-bound work. Avoid oversubscribing CPU with many threads.

Result: 9 replicas × ~67 RPS ≈ 603 RPS at p95 ≈ 120 ms.

Example 2 — GPU batching sweet spot

Scenario: Single GPU, dynamic batching allowed. Per-batch latency model: L(b) = 4 ms (launch) + 2 ms (transfer) + 1 ms × b (compute) → L(b) = 6 + b ms. End-to-end p95 budget: 50 ms, and you want queue wait ≤ 10 ms.

b=1 → 7 ms; throughput ≈ 1/7 items/ms ≈ 143 RPS
b=8 → 14 ms; throughput ≈ 571 RPS
b=16 → 22 ms; throughput ≈ 727 RPS
b=32 → 38 ms; throughput ≈ 842 RPS

All meet the 50 ms processing budget, but queue wait adds up. With max queue wait 10 ms, b=32 might flirt with p95 if arrivals are bursty. Choose b=16–32 and set max queue delay to 8–10 ms. Start with b=16 for safer tails, then validate with load tests.

Example 3 — Reducing tail latency (p99) without losing throughput

Observed: p50=40 ms, p95=120 ms, p99=500 ms spikes during bursts. Causes often include unbounded queues or heavy GC.

Cap queueing: Set max in-flight requests per replica and a short batcher timeout (e.g., 8 ms). Reject or shed load gracefully.

Stabilize runtime: Preload model, warm pools, pin memory, use gRPC with keep-alives to cut handshake overhead.

Autoscale on signals that matter: p95 latency and in-flight requests, not CPU-only. Add 20–30% headroom.

Result: p99 drops to < 200 ms while sustaining target throughput.

Optimization playbook

Set clear SLOs: p95 latency, error rate, and cost per 1k requests.
Measure baselines: p50/p95/p99 latency, RPS, queue wait, CPU/GPU utilization, memory, network.
Cut fixed overhead: keep-alive connections, gRPC/HTTP2, reuse inference sessions, vectorize pre/post-processing.
Accelerate model: quantize, compile (ONNX Runtime/TensorRT), fuse ops, enable mixed precision; verify accuracy budget.
Tune batching: start small, cap queue wait (e.g., 8–15 ms), pick batch sizes that keep GPU busy without tail spikes.
Right-size concurrency: CPU-bound → processes per core; I/O-bound → async workers. Avoid oversubscription.
Control tails: bounded queues, timeouts, load shedding, warm pools, JIT warmup.
Autoscale safely: use latency and in-flight requests as primary signals; add cooldowns to prevent thrash.
Validate: run load tests with realistic traffic shape (bursts, sustained peaks) and confirm SLOs.

Common mistakes and how to self-check

Unbounded queues cause p99 explosions. Self-check: is there a hard cap on in-flight requests and a short batch timeout?
CPU oversubscription. Self-check: processes × threads > cores by a large factor? If CPU-bound, reduce threads.
Ignoring network/serialization. Self-check: measure wire time; try gRPC, compression, or smaller payloads.
Batching without a wait cap. Self-check: does batcher enforce max queue delay (e.g., 10 ms)?
Scaling on CPU only for GPU workloads. Self-check: use p95 latency, in-flight, or RPS/replica, not just GPU utilization.
No warmup. Self-check: first-request latency much higher than steady-state? Add warmup requests on startup.

Practical projects

CPU inference API: Package a model, instrument p50/p95/p99, tune worker processes, and meet a 120 ms p95 at 200 RPS.
GPU batching demo: Deploy with a dynamic batcher, find the batch size and timeout that maximize throughput while keeping p95 under 60 ms.
Autoscaling experiment: Configure autoscaling using p95 latency and in-flight requests; run a bursty load and prove SLO adherence with 20% headroom.

Exercises

Do these, then check your answers below. A short checklist is provided to validate your setup.

Exercise 1 — Plan capacity and workers

You must handle 600 RPS with p95 ≤ 120 ms for a CPU-bound model. Each node has 8 vCPUs. Pick a per-replica worker configuration and estimate how many replicas you need to meet the SLO without oversubscribing CPU.

Hints

Concurrency ≈ RPS × latency_s.
Capacity per replica ≈ workers / latency_s (for CPU-bound, workers ≈ number of cores).

Show solution

Concurrency target ≈ 600 × 0.12 = 72. Per-replica capacity ≈ 8 / 0.12 ≈ 67 RPS with 8 processes and 1 thread each. Replicas needed ≈ 600 / 67 ≈ 9. Configure 8 workers (one per core), 1 thread per worker, preloaded model.

Exercise 2 — Pick a GPU batch size

Latency model: L(b) = 6 + b ms. End-to-end p95 budget: 50 ms. Keep queue wait ≤ 10 ms. Which batch size would you start with and why?

Hints

Compute L(b) for b ∈ {8, 16, 32}.
Consider tail risk from queue wait when batches are large.

Show solution

L(8)=14 ms, L(16)=22 ms, L(32)=38 ms. All fit within 50 ms processing. With ≤10 ms queue wait, b=32 risks tail spikes under bursty arrivals. Start with b=16 and batch timeout ≈ 8–10 ms. Validate under load; increase to 32 only if tails remain within SLO.

Checklist — before moving on

You can compute concurrency from RPS and latency.
You can estimate per-replica capacity and replicas needed.
You can justify a batch size and queue-wait cap to protect p95.
You know when to scale out versus tune workers.

Learning path

Start: instrument latency percentiles, RPS, and utilization in a simple inference API.
Next: apply model acceleration (quantization/compilation) and re-measure.
Then: add batching/concurrency and bounded queues; confirm tails.
Finally: configure autoscaling on p95 latency and in-flight requests; run burst tests.

Next steps

Integrate structured logging and tracing to pinpoint slow stages.
Add canary testing for new model versions to protect SLOs during rollout.
Design dashboards that show p50/p95/p99, RPS/replica, and queue wait together.

Mini challenge

Given a current deployment with p95=140 ms at 350 RPS, reduce p95 to 100 ms without adding replicas. You may: optimize serialization, compile the model, and rebalance batch size and timeout. Document each change, its measured effect, and the final configuration.

Quick test

Take the short test below to confirm you’ve got the essentials. Available to everyone; if you’re logged in, your progress is saved.

Menu

Latency And Throughput Optimization

Table of Contents