Topic Not Found

Why this matters

In production, great models fail if responses are slow or the service can’t handle load. As a Machine Learning Engineer, you’ll routinely:

Hit strict SLAs like p95 < 120 ms for real-time inference.
Handle traffic spikes (e.g., promo campaigns) without timeouts.
Balance batching on GPUs with user-facing latency.
Scale instances and tune concurrency safely.
Diagnose bottlenecks across network, CPU, GPU, and storage.

Concept explained simply

Latency is how long one request takes end-to-end. Throughput is how many requests per second the system completes. You can trade one for the other via batching and concurrency, but only up to your hardware limits.

Mental model: Think of your service as a pipeline plus a waiting line.

Pipeline stages: network in → deserialize → preprocess → model → postprocess → serialize → network out.
Latency = queue time + service time. If the line is long (high utilization), queue time dominates.
Throughput ≈ concurrency ÷ service_time.

Useful rules of thumb

Little’s Law: L = λ × W. If arrival rate (λ) approaches capacity, W (waiting/latency) explodes.
Tail latency (p95/p99) grows rapidly near 70–80% utilization. Keep headroom.
Batching boosts throughput on accelerators but adds wait time. Use a max_wait_ms cap.

Key formulas and targets

Latency per request: latency = queue_time + service_time.
Throughput (approx): throughput ≈ concurrency / service_time_seconds.
Capacity planning: target_utilization ≤ 0.7–0.8 to protect tail latency.
Batching on GPUs (simple model): service_time(b) ≈ fixed_overhead + per_item_time × b.

Example targets you might set

SLA: p95 < 120 ms, error rate < 0.1%.
Headroom: maintain utilization ≤ 70% at expected peak.
Batching: max_batch_size tuned to keep p95 within SLA; max_wait_ms ≤ 20 ms for UX.

Worked examples

Example 1: Estimating throughput vs concurrency

Assume average service time per request is 40 ms (0.04 s), and you set concurrency = 4 (e.g., 4 worker threads).

Throughput ≈ concurrency / service_time = 4 / 0.04 = 100 RPS.
At 100 RPS arrival, utilization ≈ 100% → tail latency spikes. To keep p95 healthy, target 70 RPS.

Takeaway

Don’t plan capacity at 100% utilization. Use 70–80% to protect p95/p99.

Example 2: GPU batching trade-off

Model on GPU: service_time(b) ≈ 6 ms + 4 ms × b. SLA p95 < 120 ms.

For b = 8 → service_time = 6 + 32 = 38 ms.
If max_wait_ms = 20 ms, worst-case per-item latency ≈ 38 + 20 = 58 ms (ignoring network). Meets SLA.
Throughput (single GPU) ≈ b / service_time = 8 / 0.038 ≈ 210 RPS.

Takeaway

Small batches (like 8) can keep latency low and dramatically increase throughput.

Example 3: Cold starts and autoscaling

If autoscaling spins up an instance with a cold-load time of 3 s, requests may queue or time out during warm-up.

Mitigation: pre-warm (readiness gate) and keep a min number of warm instances during business hours.
Benefit: avoids sudden p95 spikes caused by cold instances joining under load.

Takeaway

Autoscaling without pre-warm strategies can worsen tail latency during spikes.

Techniques toolbox

Reduce service time

Model optimizations: quantization, compilation (e.g., graph optimization), operator fusion.
I/O cuts: avoid repeated model loads, cache tokenizers and pre/post artifacts.
Serialization: prefer binary encodings for large payloads; avoid excessive JSON parsing.

Reduce queue time

Scale out before saturation: target 60–80% utilization.
Tune concurrency: right-size threads/workers and GPU streams.
Prioritize critical requests; set fair timeouts and backpressure.

Batching heuristics

Start small (e.g., max_batch_size 4–16) with max_wait_ms 10–20.
Monitor p95, GPU utilization, and batch fullness; increase batch size gradually.
Cap batch wait to protect latency when traffic is low.

Stabilize tail latency

Time-box pre/post steps; cut heavy string ops.
Isolate noisy neighbors: set CPU pinning or resource requests/limits.
Use circuit breakers and strict timeouts.

Checklist: before you ship

p50, p90, p95, p99 logged per stage and end-to-end.
Load tested for expected peak × 1.2 headroom.
Autoscaling tested with pre-warm and readiness probes.
Batching tuned with max_wait_ms guarding UX.
Concurrency validated (no oversubscription or thrashing).

Exercises

Everyone can take the exercises and the quick test. Note: only logged-in users will have progress saved.

Exercise 1: Tune batch size and queue wait

Goal: Maximize throughput on a single GPU while keeping p95 ≤ 120 ms.

Given: service_time(b) ≈ 6 ms + 4 ms × b. Arrival rate ≈ 200 RPS. Choose max_batch_size and max_wait_ms.
Assume p95 ≈ service_time(b) + max_wait_ms (ignore network for simplicity).

Hint

Try b in {4, 8, 12}. Keep p95 ≤ 120 ms with a reasonable max_wait_ms (10–25 ms range).

Exercise 2: Concurrency and autoscaling

Goal: Determine per-instance capacity and required instance count.

Given: average service_time = 12 ms (0.012 s). To protect p95, use utilization cap = 70%.
Set per-instance concurrency c = 2. Required steady load = 600 RPS with 20% headroom.
Compute: per-instance RPS capacity and the number of instances to meet target capacity.

Hint

Throughput per instance ≈ c × (1 / service_time) × utilization_cap. Headroom: divide required RPS by 0.8 to size capacity.

Self-check checklist

Did your chosen batch size keep p95 within the SLA?
Is your headroom ≥ 20% above expected load?
Did you avoid running at >80% utilization?

Common mistakes (and how to self-check)

Planning at 100% utilization: Expect p95 to blow up. Self-check: Is target utilization ≤ 70–80%?
Oversized batches when traffic is low: Users wait for batch fill. Self-check: Is max_wait_ms set small enough?
Ignoring serialization: JSON marshalling can dominate small models. Self-check: Profile per stage.
Autoscaling too late: Scale triggers on CPU only. Self-check: Use latency or queue length signals.
Starving GPU with too few workers: Low utilization. Self-check: Increase concurrency or streams incrementally and monitor.

Practical projects

Build a toy inference service with two configs: (A) no batching, (B) batching with max_wait_ms. Compare p95 under 50–200 RPS.
Instrument per-stage timings in your pipeline and produce a flame chart or stacked bars.
Run a load test script increasing RPS until p95 breaches SLA; derive safe operating point and headroom.

Who this is for

Machine Learning Engineers deploying real-time inference.
Backend/Platform engineers supporting ML services.

Prerequisites

Basic HTTP/gRPC service knowledge.
Comfort with concurrency and asynchronous processing.
Familiarity with profiling and simple load testing.

Learning path

Measure: add per-stage timing and p95/p99 metrics.
Stabilize: set utilization targets, timeouts, backpressure.
Optimize: model speedups, serialization, batching.
Scale: tune concurrency and autoscaling policies.
Harden: cold-start strategy, regression alarms.

Next steps

Apply batching and max_wait_ms in a staging environment; verify p95.
Add alarms on p95 and queue depth; rehearse spike playbooks.
Iterate on concurrency until you reach stable tail latency and desired throughput.

Mini challenge

Your current p95 is 140 ms vs. a 120 ms SLA. GPU utilization is 45%, batch fullness is 3/8 on average, and CPU is 90% from JSON parsing. Propose two changes that bring p95 under SLA without adding instances.

Menu

Latency And Throughput Optimization

Table of Contents

Why this matters

Concept explained simply

Key formulas and targets

Worked examples

Example 1: Estimating throughput vs concurrency

Example 2: GPU batching trade-off

Example 3: Cold starts and autoscaling

Techniques toolbox

Reduce service time

Reduce queue time

Batching heuristics

Stabilize tail latency

Exercises

Exercise 1: Tune batch size and queue wait

Exercise 2: Concurrency and autoscaling

Common mistakes (and how to self-check)

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Tune batch size and queue wait

Instructions

Expected Output

Concurrency and autoscaling

Latency And Throughput Optimization — Quick Test

Have questions about Latency And Throughput Optimization?

AI Assistant