Topic Not Found

Why this matters

As an NLP Engineer, you will ship models that answer user queries, classify texts, generate summaries, or embed documents. Users feel every millisecond of latency and your company pays for every token and instance you run. Optimizing latency and throughput keeps experiences smooth and costs controlled.

Keep p95/p99 response times under SLOs for chat and search.
Handle traffic spikes without errors or timeouts.
Reduce cost per request by right-sizing batching, concurrency, and autoscaling.
Maintain quality: no truncation, no timeouts, consistent results.

Concept explained simply

Latency is how long a single request takes. Throughput is how many requests you can complete per second. They are linked by queues and capacity. For NLP, two big workloads exist: non-generative (classification, embeddings) and generative (LLMs producing tokens over time).

Mental model

Imagine a checkout with a short batching lane. If you let more customers accumulate (bigger batch), the cashier works efficiently (high throughput) but early customers wait longer (higher latency). If you rush each customer immediately (batch=1), latency is low but cashier is underutilized (lower throughput). The right balance depends on your SLO and traffic.

Little's Law: concurrency ≈ throughput × latency. If you know two, you can estimate the third. If latency increases, concurrency rises for the same throughput, which can grow queues and tail latency.

Key metrics to track

p50, p95, p99 latency (ms) end-to-end (client to response).
Throughput (QPS/RPS) and tokens per second (for generative models).
Concurrency: in-flight requests. Queue length and wait time.
First-token latency vs total generation latency (for LLMs).
GPU/CPU utilization, memory, and batch size in inference.
Error rate/timeouts and saturation signals (thread pool, queue depth).

Levers that reduce latency

Smaller batch size or shorter batch window.
Streaming responses (send tokens/chunks as ready).
Caching (tokenization cache, embedding cache, model warmup).
Quantization/distillation (smaller, faster models).
Pin memory, efficient tokenization, fewer pre/post-processing hops.
Autoscale earlier to avoid queues (concurrency or queue-length scaling).

Levers that increase throughput

Dynamic batching (micro-batching within a small window).
Higher concurrency (multiple GPU streams/instances), if not saturating.
Kernel fusion, optimized runtimes, tensor cores, CPU vectorization.
Asynchronous I/O and thread pools sized to hardware.
Avoid unnecessary serialization and data copies.

Trade-offs to watch

Batching raises throughput but increases per-request wait. Keep window tight (e.g., 2–10 ms for real-time).
Concurrency improves throughput until you hit contention; then p95/p99 explode.
Quantization may slightly lower accuracy; validate quality metrics.
Streaming improves perceived latency but can raise transport overhead.

Worked examples

Example 1 — Dynamic batching for a classifier

Assume a GPU model with latency profile: base 20 ms at batch=1, plus ~3 ms per extra item. Network overhead = 5 ms. Batch window = 8 ms. Two GPU streams (concurrency=2).

Batch=8: inference ≈ 20 + 7×3 = 41 ms; total ≈ 41 + 5 + 8 = 54 ms. Throughput per stream ≈ 8 / 0.054 ≈ 148 rps; both streams ≈ 296 rps.
Batch=16: inference ≈ 20 + 15×3 = 65 ms; total ≈ 65 + 5 + 8 = 78 ms. Throughput per stream ≈ 16 / 0.078 ≈ 205 rps; both streams ≈ 410 rps.

p95 may be ~10–30% higher due to queueing; at batch=16, p95 ≈ 85–100 ms still under a 150 ms SLO. Conclusion: batch=16 yields higher throughput with acceptable latency.

Example 2 — Streaming for a generative model

Assume an LLM produces 50 tokens/s after the first token. Prompt processing time ≈ 120 ms. First token ready ≈ 80 ms after that. For 100 tokens:

Non-streaming perceived latency ≈ 120 + 80 + (99/50)s ≈ 120 + 80 + 1980 ms ≈ 2.18 s.
Streaming perceived latency to first content ≈ 200 ms; then content flows for ~2 s.

Streaming cuts perceived latency by an order of magnitude, improving UX, while total compute stays similar.

Example 3 — Autoscaling and headroom

You run an embeddings service. One instance at batch=32 handles ~350 rps with p95 ≈ 120 ms. Required steady-state load: 2000 rps.

Instances needed (no headroom): ceil(2000 / 350) = 6.
Add 30% headroom for spikes and to keep queues short: ceil(2000×1.3 / 350) = ceil(2600/350) = 8.
Concurrency per instance ≈ throughput × latency ≈ 350 × 0.12 ≈ 42. Set max in-flight ≈ 50 and trigger scale-out if queue length grows.

Hands-on: Measure and optimize

Define SLOs: e.g., p95 < 150 ms, error rate < 1% at 600 rps.
Baseline: record p50/p95, throughput, GPU/CPU utilization, concurrency, first-token latency.
Load test: sweep QPS and capture tail latencies and saturation points.
Optimize in small steps: enable dynamic batching (2–10 ms window), adjust batch size, set safe concurrency, enable streaming for LLMs, consider quantization.
Re-test and compare. Keep changes that improve both p95 and cost per request or meet SLOs with acceptable trade-offs.

Quick checklist

End-to-end timing includes tokenization and network.
p95/p99 tracked per route/model version.
Batch window tuned to SLO (start at 5 ms).
Concurrency capped to prevent queue explosions.
Autoscaling uses queue length / concurrency signals.
Streaming enabled for chat/generation.

Exercises

Do the exercise below, then check your work. Your progress is saved only if you are logged in; otherwise you can still complete everything for free.

Exercise 1: Sizing batch, concurrency, and replicas to meet an SLO.

Target: p95 < 150 ms at 600 rps for a text classifier.
Model latency: 18 ms at batch=1, plus ~2.5 ms per extra item.
Overheads: network 4 ms; batch window 6 ms for batch>1.
GPU supports 2 inference streams (concurrency=2).
Assume tail inflation ≈ +20% from queueing.
Tasks: choose batch size, estimate p95, compute throughput per instance, and replicas needed.

Tip

Total per-request time ≈ model_time + network + batch_window. Throughput per stream ≈ batch_size / total_time. Instance throughput ≈ streams × per-stream throughput. Replicas = ceil(target_rps / instance_throughput).

Common mistakes and self-check

Common mistakes

Letting batch windows grow to meet throughput and accidentally violating p95 SLOs.
Using only CPU/GPU utilization to scale; queues explode before utilization maxes out.
Ignoring tokenization and serialization costs; p95 drifts upward.
Over-concurrency on a single GPU causing kernel contention and tail latency spikes.
Skipping warmup; cold starts hurt p99.

How to self-check

Compare p95 before/after each change; keep a small performance diary.
Ensure Little's Law holds approximately with your measurements (big deviations suggest hidden queues).
Verify streaming actually reduces time-to-first-byte in client measurements.
Re-run tests at least 3 times; check variance.

Practical projects

Serve a sentiment classifier with dynamic batching and show p50/p95 before and after.
Enable token streaming for a small LLM and plot perceived latency vs total time.
Implement queue-length-based autoscaling policy in a staging environment and demonstrate stable p95 under a traffic spike.

Who this is for

NLP Engineers shipping real-time or near-real-time inference services.
ML Engineers optimizing cost and performance of production NLP endpoints.

Prerequisites

Basic Python and model serving experience.
Understanding of batching, GPU/CPU basics, and HTTP/gRPC APIs.
Familiarity with logging/metrics collection.

Learning path

Start with measurement: build a minimal load test and collect p50/p95.
Add dynamic batching and safe concurrency caps.
Introduce streaming for generative models.
Explore quantization/distillation if SLOs still tight.
Finalize with autoscaling and backpressure.

Next steps

Harden observability: dashboards for p95, queue depth, and tokens/s.
Introduce canary deployments for performance changes.
Run a cost-per-request analysis to guide future trade-offs (varies by country/company; treat as rough ranges).

Mini challenge

Reduce your service's p95 by 20% without increasing cost per request. Allowed: batch window tuning, streaming, and concurrency caps. Show before/after metrics and one graph.

Quick Test

The quick test is available to everyone. Only logged-in users will have their progress saved.

Menu

Latency And Throughput Optimization

Table of Contents