Topic Not Found

Why this matters

As a Computer Vision Engineer, you ship models that power real-time cameras, robotics, AR, and safety systems. A great model that responds too slowly will be turned off in production. Optimizing latency (how long each request takes) and throughput (how many requests per second) ensures smooth user experiences and efficient use of hardware budgets.

Retail cameras: keep p95 < 150 ms for real-time alerts.
Video analytics: process many streams without dropping frames.
Robotics/AR: tight control loops require predictable tail latency.

Concept explained simply

Latency is the time from request to response. Throughput is how many requests you handle per second (RPS) or frames per second (FPS). They influence each other: batching often increases throughput but can add waiting time, increasing latency. Your job is balancing both against an SLO (e.g., p95 < 120 ms).

Quick glossary

p50 / p95 / p99: median / 95th / 99th percentile latency. Optimize for tail (p95/p99) in production.
Service time: time to process when there is no queue (compute + I/O within the server).
Queueing delay: time a request waits for resources before service.
Little's Law (intuition): average concurrency ≈ throughput × latency.

Mental model

Picture a pipeline with five knobs you can tune:

Compute: faster math (FP16/INT8, op fusion, better kernels).
Memory: avoid copies, pin buffers, keep tensors contiguous and on device.
I/O: decode/resize efficiently, compress wisely, avoid heavy serialization.
Batching: increase GPU utilization; trade a bit of wait time for bigger batches.
Concurrency: run multiple model instances or streams to reduce queueing.

Optimization playbook (practical)

Set SLOs: example SLO p95 ≤ 120 ms; minimum throughput 100 RPS.
Profile baseline: break down time into pre/post, network, model compute, queue.
Speed up compute: convert to FP16/INT8, fuse ops, compile/optimize graph, prefer static shapes when possible.
Fix I/O: batch decode, use zero-copy where possible, send tensors not images if safe.
Tune batching: choose max batch and a small timeout (e.g., 5–15 ms) for micro-batching.
Raise concurrency: more model instances/streams until tail latency or memory becomes a problem.
Control queues: cap queue length, drop stale frames for video, apply timeouts, prioritize urgent traffic.
Autoscale: scale out once single-node knobs are near their limits; warm up models to avoid cold starts.

Common trade-offs to remember

Batch size ↑ → throughput ↑ but wait time may ↑.
Concurrency ↑ → queueing ↓ initially, then contention ↑ (context switches, memory pressure).
Compression ↑ → bandwidth ↓ but CPU decode ↑.

Worked examples

Example 1 — Single-image API, GPU

Goal: 100 RPS, p95 ≤ 120 ms on one GPU.

Baseline FP32 service time per image: 28 ms; pre/post: 6 ms; network: 4 ms.
FP16 conversion reduces model compute to ~18 ms. Empirical batch times: b1=18 ms, b2=20 ms, b4=24 ms, b8=32 ms.

Pick micro-batch size 4 with 10 ms max wait. Estimated p95 latency ≈ 10 (wait) + 6 (pre/post) + 4 (net) + 24 (compute) = ~44 ms. Throughput per instance ≈ 4/0.024 = ~167 RPS. One instance with batch 4 meets both SLOs.

Example 2 — Multi-camera video

6 cameras × 25 FPS = 150 FPS input. Model compute: 7 ms at batch 8 (~1.14k FPS raw if fully saturated), 11 ms at batch 16 (diminishing returns due to memory). Preprocess 2 ms per frame on CPU.

To protect tail latency p95 < 150 ms, cap per-camera queue to 1 and use dynamic micro-batching with max batch 8, timeout 5 ms.
If spikes cause queue growth, set stride = 2 (process ~12.5 FPS per camera) and drop oldest frames.

Result: stable p95 well under 150 ms with graceful quality trade-off during bursts.

Example 3 — Edge CPU-only

Object detection on CPU needs p95 ≤ 200 ms at ~10 FPS. Quantize to INT8 and use a vectorized runtime. Pre/post is 8 ms, model compute 12 ms INT8. With small batch 2 and a 5 ms timeout, latency ~5 + 8 + 12 = 25 ms; throughput ~80 FPS theoretical, but you cap at 10 FPS to save power and allow other tasks.

Example 4 — Capacity check via Little's Law

Service time ~40 ms per request, two parallel instances. Max RPS (no queuing) ≈ instances / service_time = 2 / 0.040 = 50 RPS. If your p95 budget is 200 ms, average concurrency budget ≈ RPS × latency ≈ 50 × 0.200 = 10 requests. Ensure queues and concurrency settings keep in-flight ≈10 or less.

Techniques you will use

Model: FP16/INT8, pruning, distillation, kernel fusion, static shapes, compiled runtimes.
GPU: appropriate batch size, multiple streams, a few instances per GPU, avoid memory overcommit.
I/O: hardware-accelerated decode, zero-copy, binary wire formats to reduce serialization overhead.
Queues: micro-batching timeouts, per-source queue caps, drop/skip policies, priority lanes.
Autoscaling: scale-out at sustained high utilization, pre-warm instances.

Common mistakes and self-check

Mistake: Chasing average latency. Fix: monitor p95/p99; they drive user experience.
Mistake: Unlimited queues. Fix: set queue limits; drop or degrade gracefully.
Mistake: Oversized batch. Fix: sweep batch sizes; stop when tail latency grows faster than throughput.
Mistake: CPU-bound preprocessing. Fix: profile; move hot paths to GPU or vectorized CPU ops.
Mistake: One huge instance per GPU. Fix: try 2–4 streams/instances; keep GPU busy without thrashing memory.

Self-check prompts

Can you state your end-to-end p95 budget and where time is spent?
Do you know the best batch size under your real traffic pattern?
What is your max in-flight requests before tail latency degrades?

Practical projects

Build a micro-batched inference server for an image classifier. Compare p95 at batch sizes 1, 2, 4, 8 under synthetic load.
Create a multi-stream video pipeline with per-camera queue caps and frame dropping. Measure how tail latency behaves during bursts.
Convert a detection model to FP16 and INT8 and benchmark throughput vs latency vs accuracy changes.

Exercises

Try these now. Solutions are included in the exercise cards below. Use this quick checklist as you work:

Write down your latency budget and throughput goal.
Estimate per-stage times (pre/post, network, model, queue).
Pick batch size and timeout; estimate latency and throughput.
Decide on instances/streams; verify memory fit.
Set queue caps and a drop/degrade policy.

Who this is for

Engineers deploying CV models to production services, edge devices, or video pipelines.
ML engineers moving from research to real-time inference systems.

Prerequisites

Basic understanding of neural network inference and CV models.
Comfort with Python or C++ for model serving, and familiarity with GPU/CPU performance basics.

Learning path

Profiling and bottleneck identification.
Model optimization (precision, pruning, compilation).
Batching and micro-batching strategies.
Concurrency and queue design.
Autoscaling and resilience (warmup, load shedding).

Next steps

Apply these steps to one of your existing services and record p50/p95 before and after.
Automate a sweep over batch sizes, timeouts, and instance counts; keep the best config per hardware type.

Mini challenge

You have a segmentation model with b1=25 ms, b2=27 ms, b4=34 ms, b8=48 ms (compute only). Pre/post/network = 10 ms combined. Target: p95 ≤ 140 ms at 120 RPS on one GPU. Propose batch size, timeout, and instance count. Justify with estimates and a short note on queue caps.

Ready to test?

Everyone can take the quick test below. Only logged-in users will see saved progress and streaks.

Menu

Latency And Throughput Optimization

Table of Contents