Menu

Topic 2 of 8

Latency And Throughput Optimization

Learn Latency And Throughput Optimization for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Who this is for

Backend engineers who need to make APIs and services fast (low latency) and capable (high throughput), whether you work on microservices, monoliths, or data-heavy systems.

Prerequisites

  • Basic HTTP and REST or RPC knowledge
  • Comfort with one backend language
  • Basic database understanding (indexes, queries)

Why this matters

Real tasks you will face:

  • Reducing API p95 latency after users report slowness
  • Handling traffic spikes without errors or timeouts
  • Designing timeouts, retries, and backpressure so one slow dependency doesn’t take the system down
  • Sizing worker pools and queues to hit an RPS target
  • Fixing N+1 queries and adding the right indexes

Concept explained simply

Definitions you will use daily

  • Latency: Time to handle one request (e.g., 120 ms). Look at p50, p95, p99, not just average.
  • Throughput: Requests per second (RPS) your service completes.
  • Concurrency: How many requests are in flight at once.
  • Tail latency: The slowest requests (p95/p99). Users feel the tail.

Mental model: Pipes and queues

Imagine water flowing through pipes. Each component is a pipe segment with its own diameter (capacity) and length (latency). The slowest, narrowest part (bottleneck) controls total flow. Queues form in front of bottlenecks. If you push water faster than the narrowest pipe can pass it, the queue grows and latency skyrockets.

A handy relationship (Little’s Law, steady state)

Concurrency ≈ Throughput × Latency. If each request takes 0.2 s and you have 100 workers, your max steady throughput is roughly 100 / 0.2 = 500 RPS.

Critical path vs. parallel work

Total latency is driven by the critical path (the longest chain of dependent steps). Independent steps should run in parallel to shorten this path. Batching many tiny calls can lower per-item overhead and improve throughput.

Backpressure and protection

  • Limit queue length and worker pool size to prevent overload.
  • Set timeouts; use retries with jitter for idempotent operations.
  • Use circuit breakers when a dependency is failing.

Measure the right things

  • RED metrics: Rate, Errors, Duration (p50/p95/p99)
  • USE metrics (for resources): Utilization, Saturation (queue length), Errors
  • Traces show the critical path across services.

Worked examples

Example 1: Parallelizing independent calls

Steps: Auth 10 ms, DB 45 ms, Downstream service 75 ms. Sequential latency: 10 + 45 + 75 = 130 ms. If DB and downstream are independent, run them in parallel: 10 + max(45, 75) = 85 ms. With 50 workers, max RPS goes from ~384 RPS (50 / 0.13) to ~588 RPS (50 / 0.085).

Example 2: Indexing a hot query

A full table scan takes ~60 ms at p95 under load. Adding an index on the filter column reduces it to ~8 ms. If this query is on the critical path, total latency can drop by ~52 ms, and throughput rises proportionally if workers were saturated.

Example 3: Batching reduces overhead

Calling a service 20 times (5 ms network cost + 2 ms processing per call) takes ~140 ms overhead plus processing. Switching to a batch call with 7 ms network + 20 × 2 ms processing = 47 ms total saves ~100 ms and reduces fan-out load.

Tactics toolbox

  • Do less work: cache responses, memoize expensive computations, use TTLs, avoid duplicate work with idempotency keys.
  • Reduce hops/bytes: keep-alive, reuse connections, compress payloads when large, prune fields, binary or compact encodings when appropriate.
  • Fix data access: proper indexes, covering indexes, avoid N+1 queries, batch reads/writes, pagination, prepared statements, connection pooling.
  • Shorten the critical path: parallelize independent calls, bound fan-out, prefetch when safe.
  • Control concurrency: right-size worker pools, limit queue depth, apply backpressure, shed load gracefully.
  • Harden networks: timeouts per hop, retries with jitter for idempotent ops, circuit breakers.
  • Runtime/OS: avoid blocking the event loop, tune thread pools, watch GC pauses and long tail effects.
  • Observe: monitor p50/p95/p99, error rate, saturation, and per-span timings in traces.

Step-by-step optimization recipe

  1. Define the goal: e.g., p95 < 120 ms at 800 RPS with < 0.5% errors.
  2. Measure baseline: capture p50/p95/p99, CPU, memory, queue lengths, and a representative trace.
  3. Find the bottleneck: look for the longest span on the critical path or the saturated resource.
  4. Apply quick wins: add obvious indexes, parallelize independent calls, batch tiny calls, cache hot reads.
  5. Guard the system: add timeouts, retries with jitter (idempotent only), circuit breakers, and backpressure.
  6. Load test: ramp traffic, watch for knee points where latency rises sharply. Adjust pool sizes/limits.
  7. Iterate: repeat trace → change → retest until the goal is met.
Mini task: pick quick wins

Look at your last production incident. List the top 2 spans by time on the critical path. For each, decide: cache, index, parallelize, or batch?

Exercises

Note: The quick test is available to everyone; if you are logged in, your progress will be saved.

Exercise 1 — Throughput and latency math (mirrors EX1)

An endpoint has: Auth 10 ms, DB 40 ms + 5 ms network, Downstream 70 ms + 5 ms network. DB and Downstream are independent.

  • 1) Compute sequential p95 latency.
  • 2) Compute parallel p95 latency.
  • 3) With 50 workers, estimate max RPS at sequential vs. parallel.
  • 4) If a cache reduces the downstream to 20 ms + 5 ms, what is the new p95 latency and max RPS?
Hint

Parallel latency = Auth + max(DB, Downstream). Throughput ≈ workers / latency (seconds).

Exercise 2 — Prioritize optimization plan (mirrors EX2)

Observations: p95 = 320 ms at 600 RPS; p99 spikes > 1.2 s. Traces show: Controller 5 ms, Auth 8 ms, DB query on orders by user_id = 120 ms with a full scan, 20 calls to Inventory (each ~12 ms) per request, response body ~1.2 MB.

  • Propose a prioritized plan with 5 changes to hit p95 < 150 ms at 800 RPS.
  • For each change, name the metric you will watch to verify improvement.
Hint

Think: index, batch calls, cache, compress/prune payload, pool sizes, timeouts/backpressure.

  • Submission checklist:
    • Your plan lists quick wins first and risky changes last.
    • Each change has a measurable metric (e.g., DB span time, payload size).
    • You considered tail latency (p99), not just average.

Common mistakes and how to self-check

  • Chasing averages: always include p95/p99 and error rate.
  • Unbounded fan-out: many tiny downstream calls. Self-check: count remote calls per request.
  • Ignoring queues: long queue → high latency. Self-check: observe queue depth and wait time.
  • Missing indexes: frequent filters scanning large tables. Self-check: examine query plans.
  • Over-retrying without timeouts/jitter: causes storms. Self-check: verify timeouts and idempotency.
  • Over-sizing pools: higher concurrency can increase tail latency if it overloads dependencies. Self-check: find the knee point via load tests.

Practical projects

  • Build a simple service with two dependencies; measure sequential vs parallel calls and document the latency change.
  • Create a read-heavy endpoint; add an index and compare query plans and p95.
  • Implement batching for a chatty integration; show reduced downstream calls and improved throughput.

Learning path

  • Before this: HTTP fundamentals, database basics, basic profiling.
  • Now: latency/throughput optimization, tail latency management, backpressure.
  • Next: reliability patterns (circuit breakers), caching strategies, performance testing at scale, database performance tuning.

Next steps

  • Run the exercises with a small load test.
  • Set a simple SLO for one endpoint and start monitoring p95.
  • Take the quick test below to check your understanding. If you log in, your progress will be saved.

Mini challenge

Your service calls 3 downstreams: A 40 ms, B 60 ms, C 35 ms. Currently sequential. Traffic doubles during peak and p99 jumps to 1 s.

  • What two changes would you deploy first to reduce tail latency safely?
Suggested answer
  • Run A, B, C in parallel with per-call timeouts and retries (idempotent only). Expected p95 ≈ max(40, 60, 35) plus small overhead.
  • Batch or cache the chattiest dependency (likely B) and cap queue depth with graceful shedding when saturated.

Practice Exercises

2 exercises to complete

Instructions

An endpoint has: Auth 10 ms, DB 40 ms + 5 ms network, Downstream 70 ms + 5 ms network. DB and Downstream are independent.

  1. Compute sequential p95 latency.
  2. Compute parallel p95 latency.
  3. With 50 workers, estimate max RPS at sequential vs. parallel.
  4. If a cache reduces downstream to 20 ms + 5 ms, what is the new p95 latency and max RPS?
Expected Output
Sequential ~130 ms; Parallel ~85 ms; ~384 RPS vs ~588 RPS; With cache ~55 ms and ~909 RPS.

Latency And Throughput Optimization — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Latency And Throughput Optimization?

AI Assistant

Ask questions about this tool