Who this is for
- Computer Vision Engineers deciding how to serve models in production.
- ML practitioners weighing cost, latency, and throughput trade-offs.
- Engineers moving from prototypes to reliable, scalable inference systems.
Prerequisites
- Basic understanding of CV model types (classification, detection, segmentation).
- Familiarity with model formats (ONNX, TorchScript) and inference runtimes (CPU/GPU).
- Basic knowledge of containers and APIs.
Why this matters
Choosing batch vs. real-time inference changes cost, user experience, and operational complexity. As a Computer Vision Engineer, you will:
- Design pipelines for CCTV analytics (live alerting vs. nightly reporting).
- Serve models behind APIs for apps that need sub-second responses.
- Plan GPU usage, scheduling, and autoscaling to control costs.
- Guarantee service-level objectives (SLOs) like p95 latency and throughput.
Concept explained simply
Definitions
- Batch inference: process many items together on a schedule or in large chunks. Optimized for throughput and cost.
- Real-time (online) inference: process each request as it arrives with strict latency targets. Optimized for responsiveness.
Mental model
Imagine parcels at a warehouse. Batch = fill a truck, then ship all at once (cheap per package, slower delivery). Real-time = motorcycle courier for each parcel (fast delivery, higher cost per package). Hybrid = motorcycles for urgent parcels, truck for the rest.
Key metrics
- Latency: p50/p95/p99 response time per request.
- Throughput: items per second (QPS) or frames per second (FPS).
- Freshness: how recent the prediction must be.
- Cost efficiency: GPU/CPU utilization, concurrency, batching efficiency.
- Reliability: error rate, timeouts, retry behavior.
Decision framework
Quick decision checklist
- If user-facing latency target is under 300 ms p95, favor real-time.
- If results can wait minutes to hours, and cost matters, favor batch.
- If some events are urgent and most are not, design a hybrid (streaming path + batch).
- If GPU utilization is low in real-time, consider dynamic batching or micro-batching.
- If inputs arrive in bursts, consider queue + autoscaling + backpressure.
Real-time patterns to hit latency SLOs
- Model optimizations: quantization, tensorRT/ONNX optimizations, smaller backbones.
- Serve on GPU with concurrency and dynamic batching (small batch sizes: 1–8).
- Warm pools to avoid cold starts; keep models loaded.
- Edge inference for camera streams to reduce network latency.
Batch patterns for cost and scale
- Process large object stores of images/videos on a schedule or event trigger.
- Use large batch sizes to maximize GPU throughput.
- Spot/preemptible instances if re-runs are acceptable.
- Checkpoint progress and write idempotent outputs.
Worked examples
1) Retail shelf audit (Batch)
Context: Stores upload photos nightly. The business needs next-morning stock-out reports.
- Mode: Batch. Freshness target: next morning (not instant).
- Pipeline: object store ingest → distributed batch job → GPU workers (batch size 32–128) → results table → dashboard.
- Why: Maximize GPU utilization, lower cost per image, no strict latency per image.
2) Driver monitoring alerts (Real-time)
Context: In-cabin camera detects drowsiness to alert the driver within 200 ms.
- Mode: Real-time. Latency target p95 ≤ 200 ms.
- Pipeline: camera frames → edge GPU model (batch size 1–4) → immediate alert.
- Why: Safety-critical, cannot wait for batch. Optimize model and keep it warm.
3) Content moderation (Hybrid)
Context: A platform scans uploads. Urgent cases need fast blocking; full review can take minutes.
- Mode: Hybrid. Real-time lightweight classifier for immediate block/allow; batch heavy model for deeper review.
- Pipeline: upload event → real-time API (fast model) → provisional decision; queued asset → batch job (heavy model) → final label.
- Why: Balances user experience with cost and accuracy.
Implementation patterns
Batch pipeline steps
- Collect items into a queue or object store.
- Launch distributed workers with large batch sizes.
- Write outputs with idempotent keys and checkpoints.
- Emit aggregate metrics and failures for retries.
Real-time pipeline steps
- Expose a low-latency API/stream consumer.
- Keep model loaded (warm) and use small dynamic batching.
- Scale horizontally by concurrent model instances.
- Monitor p95/p99 latency and error rates; autoscale on QPS/backlog.
Hybrid pattern ideas
- Micro-batching: accumulate requests for 5–50 ms to batch 4–16 items.
- Two-tier models: fast filter online, accurate model offline.
- Backfill: real-time first; batch recomputes with improved models.
Monitoring and testing
- Load testing: estimate max QPS/FPS, observe p95 latency and GPU utilization.
- Canary releases: route a small percent to new model, compare metrics.
- Data drift checks: sample inputs, compare distributions and accuracy.
- Retries and timeouts: set safe timeouts; retry only idempotent operations.
Exercises
Note: The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1: Classify scenarios
Decide batch vs. real-time (or hybrid) and justify using latency, freshness, and cost.
- A) Factory defect detection on a conveyor that triggers an ejector.
- B) Weekly wildlife camera traps to count species.
- C) Social app avatar nudity filter at upload time with final review later.
Hints
- What happens if you wait minutes?
- Is there a user in the loop waiting?
- Could a two-tier approach save cost?
Expected output format
For each scenario: chosen mode + one-sentence justification.
Exercise 2: Capacity and batching
You serve a detection model on a GPU. Single-image latency is 20 ms. Dynamic batching adds 5 ms overhead per batch. Assume near-linear speedup with batch size until 8.
- 1) Choose a batch size to keep p95 under 120 ms.
- 2) Estimate images/second for that batch size.
- 3) If incoming rate is 200 img/s with bursty traffic, name two controls to protect latency.
Hints
- Per-batch latency ≈ overhead + max(single-inference times within batch).
- Throughput ≈ batch_size / batch_latency.
- Think queue depth and autoscaling.
Expected output format
Chosen batch size, p95 estimate, images/s estimate, and two controls.
Self-check checklist
- You tied the choice to explicit latency/freshness targets.
- You considered GPU utilization and batching efficiency.
- You mentioned autoscaling or backpressure for bursty loads.
Common mistakes and how to self-check
- Ignoring p95/p99: Optimize average latency but miss tail behavior. Self-check: track p95 under realistic load.
- No backpressure: Queues grow unbounded. Self-check: set max queue size and shed load gracefully.
- Over-sized batches: Great throughput but violate SLOs. Self-check: cap batch size by latency budget.
- Cold starts: On-demand containers cause spikes. Self-check: keep warm instances during peak.
- One-size-fits-all: Forcing real-time when batch is cheaper. Self-check: ask “What breaks if results arrive later?”
Practical projects
Project 1: Batch image warehouse
- Ingest 10k images from storage and run batch inference with checkpoints.
- Measure throughput vs. batch size (8, 16, 32, 64) and report cost/time trade-offs.
Project 2: Real-time micro-batched API
- Build a small API that collects requests for up to 25 ms.
- Compare p95 and throughput with batch sizes 1 vs. micro-batched.
Learning path
- Clarify SLOs: latency targets, freshness, error budget.
- Prototype baseline inference and measure single-item latency.
- Introduce batching or micro-batching and re-measure.
- Add autoscaling and backpressure for bursts.
- Set up monitoring for p95/p99, throughput, and errors.
Next steps
- Pick one of the practical projects and implement a minimal version.
- Run a 30-minute load test and record p95, throughput, and GPU utilization.
- Take the quick test to confirm the key trade-offs.
Mini challenge
A traffic management team wants incident detection from city cameras. Alerts should fire within 2 seconds, but detailed incident summaries can arrive within 10 minutes. Propose an architecture (mode choice, batching strategy, and two monitoring metrics).