luvv to helpDiscover the Best Free Online Tools
Topic 6 of 9

Latency And Throughput Optimization

Learn Latency And Throughput Optimization for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

In production, great models fail if responses are slow or the service can’t handle load. As a Machine Learning Engineer, you’ll routinely:

  • Hit strict SLAs like p95 < 120 ms for real-time inference.
  • Handle traffic spikes (e.g., promo campaigns) without timeouts.
  • Balance batching on GPUs with user-facing latency.
  • Scale instances and tune concurrency safely.
  • Diagnose bottlenecks across network, CPU, GPU, and storage.

Concept explained simply

Latency is how long one request takes end-to-end. Throughput is how many requests per second the system completes. You can trade one for the other via batching and concurrency, but only up to your hardware limits.

Mental model: Think of your service as a pipeline plus a waiting line.

  • Pipeline stages: network in → deserialize → preprocess → model → postprocess → serialize → network out.
  • Latency = queue time + service time. If the line is long (high utilization), queue time dominates.
  • Throughput ≈ concurrency ÷ service_time.
Useful rules of thumb
  • Little’s Law: L = λ × W. If arrival rate (λ) approaches capacity, W (waiting/latency) explodes.
  • Tail latency (p95/p99) grows rapidly near 70–80% utilization. Keep headroom.
  • Batching boosts throughput on accelerators but adds wait time. Use a max_wait_ms cap.

Key formulas and targets

  • Latency per request: latency = queue_time + service_time.
  • Throughput (approx): throughput ≈ concurrency / service_time_seconds.
  • Capacity planning: target_utilization ≤ 0.7–0.8 to protect tail latency.
  • Batching on GPUs (simple model): service_time(b) ≈ fixed_overhead + per_item_time × b.
Example targets you might set
  • SLA: p95 < 120 ms, error rate < 0.1%.
  • Headroom: maintain utilization ≤ 70% at expected peak.
  • Batching: max_batch_size tuned to keep p95 within SLA; max_wait_ms ≤ 20 ms for UX.

Worked examples

Example 1: Estimating throughput vs concurrency

Assume average service time per request is 40 ms (0.04 s), and you set concurrency = 4 (e.g., 4 worker threads).

  1. Throughput ≈ concurrency / service_time = 4 / 0.04 = 100 RPS.
  2. At 100 RPS arrival, utilization ≈ 100% → tail latency spikes. To keep p95 healthy, target 70 RPS.
Takeaway

Don’t plan capacity at 100% utilization. Use 70–80% to protect p95/p99.

Example 2: GPU batching trade-off

Model on GPU: service_time(b) ≈ 6 ms + 4 ms × b. SLA p95 < 120 ms.

  1. For b = 8 → service_time = 6 + 32 = 38 ms.
  2. If max_wait_ms = 20 ms, worst-case per-item latency ≈ 38 + 20 = 58 ms (ignoring network). Meets SLA.
  3. Throughput (single GPU) ≈ b / service_time = 8 / 0.038 ≈ 210 RPS.
Takeaway

Small batches (like 8) can keep latency low and dramatically increase throughput.

Example 3: Cold starts and autoscaling

If autoscaling spins up an instance with a cold-load time of 3 s, requests may queue or time out during warm-up.

  1. Mitigation: pre-warm (readiness gate) and keep a min number of warm instances during business hours.
  2. Benefit: avoids sudden p95 spikes caused by cold instances joining under load.
Takeaway

Autoscaling without pre-warm strategies can worsen tail latency during spikes.

Techniques toolbox

Reduce service time

  • Model optimizations: quantization, compilation (e.g., graph optimization), operator fusion.
  • I/O cuts: avoid repeated model loads, cache tokenizers and pre/post artifacts.
  • Serialization: prefer binary encodings for large payloads; avoid excessive JSON parsing.

Reduce queue time

  • Scale out before saturation: target 60–80% utilization.
  • Tune concurrency: right-size threads/workers and GPU streams.
  • Prioritize critical requests; set fair timeouts and backpressure.

Batching heuristics

  • Start small (e.g., max_batch_size 4–16) with max_wait_ms 10–20.
  • Monitor p95, GPU utilization, and batch fullness; increase batch size gradually.
  • Cap batch wait to protect latency when traffic is low.

Stabilize tail latency

  • Time-box pre/post steps; cut heavy string ops.
  • Isolate noisy neighbors: set CPU pinning or resource requests/limits.
  • Use circuit breakers and strict timeouts.
Checklist: before you ship
  • p50, p90, p95, p99 logged per stage and end-to-end.
  • Load tested for expected peak × 1.2 headroom.
  • Autoscaling tested with pre-warm and readiness probes.
  • Batching tuned with max_wait_ms guarding UX.
  • Concurrency validated (no oversubscription or thrashing).

Exercises

Everyone can take the exercises and the quick test. Note: only logged-in users will have progress saved.

Exercise 1: Tune batch size and queue wait

Goal: Maximize throughput on a single GPU while keeping p95 ≤ 120 ms.

  • Given: service_time(b) ≈ 6 ms + 4 ms × b. Arrival rate ≈ 200 RPS. Choose max_batch_size and max_wait_ms.
  • Assume p95 ≈ service_time(b) + max_wait_ms (ignore network for simplicity).
Hint

Try b in {4, 8, 12}. Keep p95 ≤ 120 ms with a reasonable max_wait_ms (10–25 ms range).

Exercise 2: Concurrency and autoscaling

Goal: Determine per-instance capacity and required instance count.

  • Given: average service_time = 12 ms (0.012 s). To protect p95, use utilization cap = 70%.
  • Set per-instance concurrency c = 2. Required steady load = 600 RPS with 20% headroom.
  • Compute: per-instance RPS capacity and the number of instances to meet target capacity.
Hint

Throughput per instance ≈ c × (1 / service_time) × utilization_cap. Headroom: divide required RPS by 0.8 to size capacity.

Self-check checklist
  • Did your chosen batch size keep p95 within the SLA?
  • Is your headroom ≥ 20% above expected load?
  • Did you avoid running at >80% utilization?

Common mistakes (and how to self-check)

  • Planning at 100% utilization: Expect p95 to blow up. Self-check: Is target utilization ≤ 70–80%?
  • Oversized batches when traffic is low: Users wait for batch fill. Self-check: Is max_wait_ms set small enough?
  • Ignoring serialization: JSON marshalling can dominate small models. Self-check: Profile per stage.
  • Autoscaling too late: Scale triggers on CPU only. Self-check: Use latency or queue length signals.
  • Starving GPU with too few workers: Low utilization. Self-check: Increase concurrency or streams incrementally and monitor.

Practical projects

  • Build a toy inference service with two configs: (A) no batching, (B) batching with max_wait_ms. Compare p95 under 50–200 RPS.
  • Instrument per-stage timings in your pipeline and produce a flame chart or stacked bars.
  • Run a load test script increasing RPS until p95 breaches SLA; derive safe operating point and headroom.

Who this is for

  • Machine Learning Engineers deploying real-time inference.
  • Backend/Platform engineers supporting ML services.

Prerequisites

  • Basic HTTP/gRPC service knowledge.
  • Comfort with concurrency and asynchronous processing.
  • Familiarity with profiling and simple load testing.

Learning path

  1. Measure: add per-stage timing and p95/p99 metrics.
  2. Stabilize: set utilization targets, timeouts, backpressure.
  3. Optimize: model speedups, serialization, batching.
  4. Scale: tune concurrency and autoscaling policies.
  5. Harden: cold-start strategy, regression alarms.

Next steps

  • Apply batching and max_wait_ms in a staging environment; verify p95.
  • Add alarms on p95 and queue depth; rehearse spike playbooks.
  • Iterate on concurrency until you reach stable tail latency and desired throughput.

Mini challenge

Your current p95 is 140 ms vs. a 120 ms SLA. GPU utilization is 45%, batch fullness is 3/8 on average, and CPU is 90% from JSON parsing. Propose two changes that bring p95 under SLA without adding instances.

Practice Exercises

2 exercises to complete

Instructions

Maximize throughput on a single GPU while keeping p95 ≤ 120 ms. Use service_time(b) ≈ 6 ms + 4 ms × b. Arrival rate ≈ 200 RPS.

  • Pick max_batch_size from {4, 8, 12}.
  • Pick max_wait_ms (10–25 ms) so p95 ≈ service_time(b) + max_wait_ms ≤ 120 ms.
  • Estimate throughput ≈ b / service_time(b) (requests per second).
Expected Output
A configuration such as max_batch_size=8, max_wait_ms=20 ms; throughput ≈ 210 RPS; p95 ≈ 58 ms (meets SLA).

Latency And Throughput Optimization — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Latency And Throughput Optimization?

AI Assistant

Ask questions about this tool