Why this matters
As a Computer Vision Engineer, you ship models that power real-time cameras, robotics, AR, and safety systems. A great model that responds too slowly will be turned off in production. Optimizing latency (how long each request takes) and throughput (how many requests per second) ensures smooth user experiences and efficient use of hardware budgets.
- Retail cameras: keep p95 < 150 ms for real-time alerts.
- Video analytics: process many streams without dropping frames.
- Robotics/AR: tight control loops require predictable tail latency.
Concept explained simply
Latency is the time from request to response. Throughput is how many requests you handle per second (RPS) or frames per second (FPS). They influence each other: batching often increases throughput but can add waiting time, increasing latency. Your job is balancing both against an SLO (e.g., p95 < 120 ms).
Quick glossary
- p50 / p95 / p99: median / 95th / 99th percentile latency. Optimize for tail (p95/p99) in production.
- Service time: time to process when there is no queue (compute + I/O within the server).
- Queueing delay: time a request waits for resources before service.
- Little's Law (intuition): average concurrency β throughput Γ latency.
Mental model
Picture a pipeline with five knobs you can tune:
- Compute: faster math (FP16/INT8, op fusion, better kernels).
- Memory: avoid copies, pin buffers, keep tensors contiguous and on device.
- I/O: decode/resize efficiently, compress wisely, avoid heavy serialization.
- Batching: increase GPU utilization; trade a bit of wait time for bigger batches.
- Concurrency: run multiple model instances or streams to reduce queueing.
Optimization playbook (practical)
- Set SLOs: example SLO p95 β€ 120 ms; minimum throughput 100 RPS.
- Profile baseline: break down time into pre/post, network, model compute, queue.
- Speed up compute: convert to FP16/INT8, fuse ops, compile/optimize graph, prefer static shapes when possible.
- Fix I/O: batch decode, use zero-copy where possible, send tensors not images if safe.
- Tune batching: choose max batch and a small timeout (e.g., 5β15 ms) for micro-batching.
- Raise concurrency: more model instances/streams until tail latency or memory becomes a problem.
- Control queues: cap queue length, drop stale frames for video, apply timeouts, prioritize urgent traffic.
- Autoscale: scale out once single-node knobs are near their limits; warm up models to avoid cold starts.
Common trade-offs to remember
- Batch size β β throughput β but wait time may β.
- Concurrency β β queueing β initially, then contention β (context switches, memory pressure).
- Compression β β bandwidth β but CPU decode β.
Worked examples
Example 1 β Single-image API, GPU
Goal: 100 RPS, p95 β€ 120 ms on one GPU.
- Baseline FP32 service time per image: 28 ms; pre/post: 6 ms; network: 4 ms.
- FP16 conversion reduces model compute to ~18 ms. Empirical batch times: b1=18 ms, b2=20 ms, b4=24 ms, b8=32 ms.
Pick micro-batch size 4 with 10 ms max wait. Estimated p95 latency β 10 (wait) + 6 (pre/post) + 4 (net) + 24 (compute) = ~44 ms. Throughput per instance β 4/0.024 = ~167 RPS. One instance with batch 4 meets both SLOs.
Example 2 β Multi-camera video
6 cameras Γ 25 FPS = 150 FPS input. Model compute: 7 ms at batch 8 (~1.14k FPS raw if fully saturated), 11 ms at batch 16 (diminishing returns due to memory). Preprocess 2 ms per frame on CPU.
- To protect tail latency p95 < 150 ms, cap per-camera queue to 1 and use dynamic micro-batching with max batch 8, timeout 5 ms.
- If spikes cause queue growth, set stride = 2 (process ~12.5 FPS per camera) and drop oldest frames.
Result: stable p95 well under 150 ms with graceful quality trade-off during bursts.
Example 3 β Edge CPU-only
Object detection on CPU needs p95 β€ 200 ms at ~10 FPS. Quantize to INT8 and use a vectorized runtime. Pre/post is 8 ms, model compute 12 ms INT8. With small batch 2 and a 5 ms timeout, latency ~5 + 8 + 12 = 25 ms; throughput ~80 FPS theoretical, but you cap at 10 FPS to save power and allow other tasks.
Example 4 β Capacity check via Little's Law
Service time ~40 ms per request, two parallel instances. Max RPS (no queuing) β instances / service_time = 2 / 0.040 = 50 RPS. If your p95 budget is 200 ms, average concurrency budget β RPS Γ latency β 50 Γ 0.200 = 10 requests. Ensure queues and concurrency settings keep in-flight β10 or less.
Techniques you will use
- Model: FP16/INT8, pruning, distillation, kernel fusion, static shapes, compiled runtimes.
- GPU: appropriate batch size, multiple streams, a few instances per GPU, avoid memory overcommit.
- I/O: hardware-accelerated decode, zero-copy, binary wire formats to reduce serialization overhead.
- Queues: micro-batching timeouts, per-source queue caps, drop/skip policies, priority lanes.
- Autoscaling: scale-out at sustained high utilization, pre-warm instances.
Common mistakes and self-check
- Mistake: Chasing average latency. Fix: monitor p95/p99; they drive user experience.
- Mistake: Unlimited queues. Fix: set queue limits; drop or degrade gracefully.
- Mistake: Oversized batch. Fix: sweep batch sizes; stop when tail latency grows faster than throughput.
- Mistake: CPU-bound preprocessing. Fix: profile; move hot paths to GPU or vectorized CPU ops.
- Mistake: One huge instance per GPU. Fix: try 2β4 streams/instances; keep GPU busy without thrashing memory.
Self-check prompts
- Can you state your end-to-end p95 budget and where time is spent?
- Do you know the best batch size under your real traffic pattern?
- What is your max in-flight requests before tail latency degrades?
Practical projects
- Build a micro-batched inference server for an image classifier. Compare p95 at batch sizes 1, 2, 4, 8 under synthetic load.
- Create a multi-stream video pipeline with per-camera queue caps and frame dropping. Measure how tail latency behaves during bursts.
- Convert a detection model to FP16 and INT8 and benchmark throughput vs latency vs accuracy changes.
Exercises
Try these now. Solutions are included in the exercise cards below. Use this quick checklist as you work:
- Write down your latency budget and throughput goal.
- Estimate per-stage times (pre/post, network, model, queue).
- Pick batch size and timeout; estimate latency and throughput.
- Decide on instances/streams; verify memory fit.
- Set queue caps and a drop/degrade policy.
Who this is for
- Engineers deploying CV models to production services, edge devices, or video pipelines.
- ML engineers moving from research to real-time inference systems.
Prerequisites
- Basic understanding of neural network inference and CV models.
- Comfort with Python or C++ for model serving, and familiarity with GPU/CPU performance basics.
Learning path
- Profiling and bottleneck identification.
- Model optimization (precision, pruning, compilation).
- Batching and micro-batching strategies.
- Concurrency and queue design.
- Autoscaling and resilience (warmup, load shedding).
Next steps
- Apply these steps to one of your existing services and record p50/p95 before and after.
- Automate a sweep over batch sizes, timeouts, and instance counts; keep the best config per hardware type.
Mini challenge
You have a segmentation model with b1=25 ms, b2=27 ms, b4=34 ms, b8=48 ms (compute only). Pre/post/network = 10 ms combined. Target: p95 β€ 140 ms at 120 RPS on one GPU. Propose batch size, timeout, and instance count. Justify with estimates and a short note on queue caps.
Ready to test?
Everyone can take the quick test below. Only logged-in users will see saved progress and streaks.