What this skill covers for Computer Vision Engineers
Deployment and model serving turns your trained vision models into reliable, responsive products. You will design inference APIs, choose between batch and real-time serving, hit latency/throughput targets, optimize with quantization and TensorRT, deploy to edge devices, and release safely with versioning, canarying, and rollbacks.
- Unlocks: production-grade image/video inference services, on-device AI features, A/B testing new model versions, and stable releases.
- You will learn to measure and meet p50/p95 latency, scale for throughput, and recover fast when a new model underperforms.
Who this is for
- Computer Vision Engineers moving from prototyping to production.
- ML Engineers who need GPU/edge-optimized inference.
- Backend engineers integrating CV models into services.
Prerequisites
- Comfort with Python and a CV framework (PyTorch or TensorFlow).
- Basic model export (TorchScript or ONNX).
- Familiarity with REST/gRPC and containers is helpful.
- GPU basics (CUDA concepts) for acceleration sections.
Learning path
- Model packaging (1–2 days): export to TorchScript/ONNX; write deterministic preprocessing; add health checks.
- Inference API design (2–3 days): REST/gRPC endpoints, request/response schema, error handling, timeouts, idempotency.
- Batch vs real-time (1–2 days): pick modes, micro-batching, and queues.
- Latency & throughput (3–4 days): p50/p95 measurement, concurrency, batching, warmup, caching.
- Quantization (2–3 days): post-training INT8, accuracy checks, fallbacks.
- TensorRT basics (3–5 days): build engines from ONNX, FP16/INT8, per-layer tactics.
- Edge deployment (3–4 days): small binaries, offline mode, thermal limits, provider choices.
- Versioning & safe releases (2–3 days): semantic versions, canaries, rollbacks, metrics & alarms.
Milestone checklist
- Model exported with reproducible preprocessing
- Health, readiness, and metrics endpoints implemented
- Meets agreed p95 latency SLO with a defined batch size
- Quantized or TensorRT-accelerated variant tested
- Edge-friendly build produced (if relevant)
- Versioned release with canary and rollback plan
Worked examples
1) FastAPI image inference microservice (CPU or GPU)
Show code
# fastapi_infer.py
from fastapi import FastAPI, UploadFile, File
from pydantic import BaseModel
from PIL import Image
import io, torch
import torchvision.transforms as T
from torchvision.models import resnet18
app = FastAPI()
class PredictOut(BaseModel):
top_class: int
confidence: float
p95_latency_ms: float | None = None
# Global objects
model = None
transform = T.Compose([
T.Resize(256), T.CenterCrop(224), T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
@app.on_event("startup")
async def load_model():
global model
model = resnet18(weights=None)
model.eval()
# Optional: torch.compile(model) on supported PyTorch versions
# Optional: torch.jit.script(model)
@app.get("/health")
async def health():
return {"status": "ok"}
@app.post("/predict", response_model=PredictOut)
async def predict(file: UploadFile = File(...)):
img_bytes = await file.read()
img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
x = transform(img).unsqueeze(0)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1)
conf, cls = torch.max(probs, dim=1)
return {"top_class": int(cls.item()), "confidence": float(conf.item()), "p95_latency_ms": None}
# Run (example): uvicorn fastapi_infer:app --host 0.0.0.0 --port 8080 --workers 2
- Notes: add timeouts, request limits, and image size caps; preload model on startup; use multiple workers for CPU-bound work.
2) Batch inference job with micro-batching
Show code
# batch_infer.py
import os, time, torch
from PIL import Image
import torchvision.transforms as T
from torchvision.models import resnet18
BATCH = 16
IN_DIR = "./images"
model = resnet18(weights=None).eval()
transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])
files = [f for f in os.listdir(IN_DIR) if f.lower().endswith((".jpg",".png"))]
start = time.perf_counter()
outputs = []
with torch.no_grad():
for i in range(0, len(files), BATCH):
chunk = files[i:i+BATCH]
imgs = []
for f in chunk:
img = Image.open(os.path.join(IN_DIR,f)).convert("RGB")
imgs.append(transform(img))
x = torch.stack(imgs)
logits = model(x)
preds = logits.argmax(dim=1).tolist()
outputs.extend(list(zip(chunk, preds)))
t = (time.perf_counter() - start) * 1000
print(f"Processed {len(files)} images in {t:.1f} ms; avg per image {(t/len(files)):.1f} ms")
- Use larger batches until GPU is saturated and p95 latency stays within SLO.
3) Measuring latency distribution (p50/p95)
Show code
import time, statistics as stats
# Given a sync predict(img) function
def bench(predict, images, warmup=5, runs=50):
# Warmup
for i in range(min(warmup, len(images))):
predict(images[i])
# Timed runs
lats = []
for i in range(runs):
img = images[i % len(images)]
t0 = time.perf_counter()
predict(img)
lats.append((time.perf_counter()-t0)*1000)
lats_sorted = sorted(lats)
def pct(p):
k = int(round((p/100.0)*(len(lats_sorted)-1)))
return lats_sorted[k]
return {"p50_ms": pct(50), "p95_ms": pct(95), "avg_ms": stats.mean(lats)}
- Always report distribution (p50/p95), not just average. Track separately for cold vs warm paths.
4) Post-training quantization (ONNX Runtime INT8)
Show code
# onnx_int8_quant.py
# Requires an ONNX model path and calibration data
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
import onnx
class ImageData(CalibrationDataReader):
def __init__(self, input_name, samples):
self.input_name = input_name
self.samples = iter(samples)
def get_next(self):
try:
return { self.input_name: next(self.samples) }
except StopIteration:
return None
onnx_model = "model.onnx"
onnx_model_q = "model_int8.onnx"
# samples: iterable of numpy arrays shaped as model input
calib_reader = ImageData(input_name="input", samples=...)
quantize_static(
model_input=onnx_model,
model_output=onnx_model_q,
calibration_data_reader=calib_reader,
weight_type=QuantType.QInt8,
)
# Validate accuracy on a holdout set and compare to FP32.
- Measure accuracy drop vs FP32. If drop is high, try per-channel quantization or keep sensitive layers in FP16/FP32.
5) TensorRT engine from ONNX (FP16)
Show code
# tensorrt_build.py
# Minimal TensorRT build script
import tensorrt as trt
onnx_path = "model.onnx"
engine_path = "model_fp16.plan"
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(network_flags)
parser = trt.OnnxParser(network, logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
with open(onnx_path, 'rb') as f:
assert parser.parse(f.read()), parser.get_error(0)
profile = builder.create_optimization_profile()
# Example input name/dims
input_name = network.get_input(0).name
profile.set_shape(input_name, (1,3,224,224), (8,3,224,224), (16,3,224,224))
config.add_optimization_profile(profile)
engine = builder.build_engine(network, config)
with open(engine_path, 'wb') as f:
f.write(engine.serialize())
print("Saved:", engine_path)
- Benchmark FP32 vs FP16 vs INT8. Validate outputs against a reference within a tolerance.
6) Simple canary router (90% v1, 10% v2)
Show code
# canary_router.py
import random, requests
SERVERS = {
"v1": {"url": "http://inference-v1:8080/predict", "weight": 0.9},
"v2": {"url": "http://inference-v2:8080/predict", "weight": 0.1},
}
def choose_version():
r = random.random()
acc = 0.0
for k, s in SERVERS.items():
acc += s["weight"]
if r <= acc:
return k
return "v2"
def route_request(image_bytes):
k = choose_version()
url = SERVERS[k]["url"]
resp = requests.post(url, files={"file": ("img.jpg", image_bytes)})
return k, resp.json()
# Track per-version metrics (latency, error rate) and be ready to shift weight or rollback.
- Expose a manual override to 0%/100% for emergency rollback.
Drills and exercises
- Define a request schema for image inference: content limits, timeouts, and error messages.
- Export your model to ONNX and verify output parity vs native framework on 100 images.
- Benchmark p50/p95 for batch sizes 1, 4, 8, 16; chart latency vs throughput.
- Quantize to INT8 and report accuracy delta vs FP32; set an acceptable threshold.
- Build an FP16 TensorRT engine; compare throughput to FP32.
- Create health/readiness endpoints and simulate a failing dependency (return 503).
- Implement a canary release plan: weights, rollback criteria, and monitoring signals.
Common mistakes and debugging tips
Mistake: Measuring only average latency
Always track distributions (p50, p95, p99). Spikes often hide in tails. Use warmup and separate cold-start metrics.
Mistake: Mismatched preprocessing
Keep the exact same resize, crop, normalization at train and serve time. Version your preprocessing alongside the model.
Mistake: Over-quantizing sensitive layers
Some layers (first/last, attention, small activations) are sensitive. Use mixed precision (FP16 for sensitive parts).
Mistake: No safe rollback
Without instant rollback you risk long outages. Keep N-1 hot, maintain versioned configs, and a manual 100% switch.
Mistake: Ignoring memory and thermal limits on edge
Measure peak RAM/VRAM and device temperature under load. Add throttles, lower resolution, or smaller models.
Debugging slow inference
- Profile per-stage (decode, preprocess, model, postprocess).
- Check batch size vs GPU utilization.
- Use FP16/INT8, enable TensorRT tactics, pin memory, and avoid CPU-GPU ping-pong.
Mini project: Real-time object classification service with safe rollout
- Pick a small pretrained classifier; export to ONNX (FP32) and build an FP16 TensorRT engine.
- Implement a FastAPI service with /health, /ready, and /predict (image upload).
- Add micro-batching (e.g., batch up to 8 for 10 ms) and measure p50/p95.
- Create an INT8 model; run accuracy checks and adopt if delta is acceptable.
- Implement canary routing between FP16 and INT8 versions (e.g., 90/10) with metrics by version.
- Prepare rollback: a config flag to route 100% to the stable model; test it.
- Edge profile: run on a low-power device or simulate CPU-only; adjust batch/resolution to meet SLO.
Deliverables checklist
- API design doc with request/response schema and limits
- Latency report with charts (p50/p95) across batch sizes and precisions
- Accuracy comparison FP32 vs FP16 vs INT8
- Canary and rollback plan with success/failure criteria
Practical projects
- Retail shelf detector: batch nightly audits from photos; real-time mode for staff scanners.
- Factory defect classifier: FP16 TensorRT on an edge GPU; logs to a central server when online.
- Vehicle camera redaction: blur faces/plates in real time; fall back to CPU with reduced resolution.
Subskills
- Inference API Design: Design REST/gRPC endpoints with clear contracts, health/readiness, timeouts, and structured errors.
- Batch Versus Real Time Inference: Choose the right mode; use micro-batching for throughput while respecting latency SLOs.
- Latency And Throughput Optimization: Profile, tune batch size and concurrency, add warmup, and remove bottlenecks.
- Model Quantization Basics: Apply post-training quantization (INT8) and validate accuracy and speed trade-offs.
- Acceleration With TensorRT Basics: Build and deploy FP16/INT8 engines from ONNX for GPUs.
- Edge Deployment Basics: Optimize for limited compute, memory, and intermittent connectivity.
- Model Versioning And Canary Releases: Version artifacts and traffic-split new models safely.
- Rollback And Safe Releases: Define and test fast rollback paths with clear triggers.
Next steps
- Automate benchmarks and regression checks in CI.
- Add structured logging, tracing IDs, and per-version metrics dashboards.
- Prepare disaster drills: force rollback, simulate degraded accuracy, and verify alerts.
Skill exam
Take the exam below to validate your understanding. Everyone can take it for free. If you are logged in, your progress and results will be saved.