How to learn Deployment And Model Serving for Computer Vision Engineer for free

What this skill covers for Computer Vision Engineers

Deployment and model serving turns your trained vision models into reliable, responsive products. You will design inference APIs, choose between batch and real-time serving, hit latency/throughput targets, optimize with quantization and TensorRT, deploy to edge devices, and release safely with versioning, canarying, and rollbacks.

Unlocks: production-grade image/video inference services, on-device AI features, A/B testing new model versions, and stable releases.
You will learn to measure and meet p50/p95 latency, scale for throughput, and recover fast when a new model underperforms.

Who this is for

Computer Vision Engineers moving from prototyping to production.
ML Engineers who need GPU/edge-optimized inference.
Backend engineers integrating CV models into services.

Prerequisites

Comfort with Python and a CV framework (PyTorch or TensorFlow).
Basic model export (TorchScript or ONNX).
Familiarity with REST/gRPC and containers is helpful.
GPU basics (CUDA concepts) for acceleration sections.

Learning path

Model packaging (1–2 days): export to TorchScript/ONNX; write deterministic preprocessing; add health checks.
Inference API design (2–3 days): REST/gRPC endpoints, request/response schema, error handling, timeouts, idempotency.
Batch vs real-time (1–2 days): pick modes, micro-batching, and queues.
Latency & throughput (3–4 days): p50/p95 measurement, concurrency, batching, warmup, caching.
Quantization (2–3 days): post-training INT8, accuracy checks, fallbacks.
TensorRT basics (3–5 days): build engines from ONNX, FP16/INT8, per-layer tactics.
Edge deployment (3–4 days): small binaries, offline mode, thermal limits, provider choices.
Versioning & safe releases (2–3 days): semantic versions, canaries, rollbacks, metrics & alarms.

Milestone checklist

Model exported with reproducible preprocessing
Health, readiness, and metrics endpoints implemented
Meets agreed p95 latency SLO with a defined batch size
Quantized or TensorRT-accelerated variant tested
Edge-friendly build produced (if relevant)
Versioned release with canary and rollback plan

Worked examples

1) FastAPI image inference microservice (CPU or GPU)

Show code

# fastapi_infer.py
from fastapi import FastAPI, UploadFile, File
from pydantic import BaseModel
from PIL import Image
import io, torch
import torchvision.transforms as T
from torchvision.models import resnet18

app = FastAPI()

class PredictOut(BaseModel):
    top_class: int
    confidence: float
    p95_latency_ms: float | None = None

# Global objects
model = None
transform = T.Compose([
    T.Resize(256), T.CenterCrop(224), T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

@app.on_event("startup")
async def load_model():
    global model
    model = resnet18(weights=None)
    model.eval()
    # Optional: torch.compile(model) on supported PyTorch versions
    # Optional: torch.jit.script(model)

@app.get("/health")
async def health():
    return {"status": "ok"}

@app.post("/predict", response_model=PredictOut)
async def predict(file: UploadFile = File(...)):
    img_bytes = await file.read()
    img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
    x = transform(img).unsqueeze(0)
    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
        conf, cls = torch.max(probs, dim=1)
    return {"top_class": int(cls.item()), "confidence": float(conf.item()), "p95_latency_ms": None}

# Run (example): uvicorn fastapi_infer:app --host 0.0.0.0 --port 8080 --workers 2

Notes: add timeouts, request limits, and image size caps; preload model on startup; use multiple workers for CPU-bound work.

2) Batch inference job with micro-batching

Show code

# batch_infer.py
import os, time, torch
from PIL import Image
import torchvision.transforms as T
from torchvision.models import resnet18

BATCH = 16
IN_DIR = "./images"

model = resnet18(weights=None).eval()
transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])

files = [f for f in os.listdir(IN_DIR) if f.lower().endswith((".jpg",".png"))]
start = time.perf_counter()

outputs = []
with torch.no_grad():
    for i in range(0, len(files), BATCH):
        chunk = files[i:i+BATCH]
        imgs = []
        for f in chunk:
            img = Image.open(os.path.join(IN_DIR,f)).convert("RGB")
            imgs.append(transform(img))
        x = torch.stack(imgs)
        logits = model(x)
        preds = logits.argmax(dim=1).tolist()
        outputs.extend(list(zip(chunk, preds)))

t = (time.perf_counter() - start) * 1000
print(f"Processed {len(files)} images in {t:.1f} ms; avg per image {(t/len(files)):.1f} ms")

Use larger batches until GPU is saturated and p95 latency stays within SLO.

3) Measuring latency distribution (p50/p95)

Show code

import time, statistics as stats

# Given a sync predict(img) function

def bench(predict, images, warmup=5, runs=50):
    # Warmup
    for i in range(min(warmup, len(images))):
        predict(images[i])
    # Timed runs
    lats = []
    for i in range(runs):
        img = images[i % len(images)]
        t0 = time.perf_counter()
        predict(img)
        lats.append((time.perf_counter()-t0)*1000)
    lats_sorted = sorted(lats)
    def pct(p):
        k = int(round((p/100.0)*(len(lats_sorted)-1)))
        return lats_sorted[k]
    return {"p50_ms": pct(50), "p95_ms": pct(95), "avg_ms": stats.mean(lats)}

Always report distribution (p50/p95), not just average. Track separately for cold vs warm paths.

4) Post-training quantization (ONNX Runtime INT8)

Show code

# onnx_int8_quant.py
# Requires an ONNX model path and calibration data
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType
import onnx

class ImageData(CalibrationDataReader):
    def __init__(self, input_name, samples):
        self.input_name = input_name
        self.samples = iter(samples)
    def get_next(self):
        try:
            return { self.input_name: next(self.samples) }
        except StopIteration:
            return None

onnx_model = "model.onnx"
onnx_model_q = "model_int8.onnx"
# samples: iterable of numpy arrays shaped as model input
calib_reader = ImageData(input_name="input", samples=...)

quantize_static(
    model_input=onnx_model,
    model_output=onnx_model_q,
    calibration_data_reader=calib_reader,
    weight_type=QuantType.QInt8,
)

# Validate accuracy on a holdout set and compare to FP32.

Measure accuracy drop vs FP32. If drop is high, try per-channel quantization or keep sensitive layers in FP16/FP32.

5) TensorRT engine from ONNX (FP16)

Show code

# tensorrt_build.py
# Minimal TensorRT build script
import tensorrt as trt

onnx_path = "model.onnx"
engine_path = "model_fp16.plan"

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(network_flags)
parser = trt.OnnxParser(network, logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)

with open(onnx_path, 'rb') as f:
    assert parser.parse(f.read()), parser.get_error(0)

profile = builder.create_optimization_profile()
# Example input name/dims
input_name = network.get_input(0).name
profile.set_shape(input_name, (1,3,224,224), (8,3,224,224), (16,3,224,224))
config.add_optimization_profile(profile)

engine = builder.build_engine(network, config)
with open(engine_path, 'wb') as f:
    f.write(engine.serialize())
print("Saved:", engine_path)

Benchmark FP32 vs FP16 vs INT8. Validate outputs against a reference within a tolerance.

6) Simple canary router (90% v1, 10% v2)

Show code

# canary_router.py
import random, requests

SERVERS = {
    "v1": {"url": "http://inference-v1:8080/predict", "weight": 0.9},
    "v2": {"url": "http://inference-v2:8080/predict", "weight": 0.1},
}

def choose_version():
    r = random.random()
    acc = 0.0
    for k, s in SERVERS.items():
        acc += s["weight"]
        if r <= acc:
            return k
    return "v2"

def route_request(image_bytes):
    k = choose_version()
    url = SERVERS[k]["url"]
    resp = requests.post(url, files={"file": ("img.jpg", image_bytes)})
    return k, resp.json()

# Track per-version metrics (latency, error rate) and be ready to shift weight or rollback.

Expose a manual override to 0%/100% for emergency rollback.

Drills and exercises

Define a request schema for image inference: content limits, timeouts, and error messages.
Export your model to ONNX and verify output parity vs native framework on 100 images.
Benchmark p50/p95 for batch sizes 1, 4, 8, 16; chart latency vs throughput.
Quantize to INT8 and report accuracy delta vs FP32; set an acceptable threshold.
Build an FP16 TensorRT engine; compare throughput to FP32.
Create health/readiness endpoints and simulate a failing dependency (return 503).
Implement a canary release plan: weights, rollback criteria, and monitoring signals.

Common mistakes and debugging tips

Mistake: Measuring only average latency

Always track distributions (p50, p95, p99). Spikes often hide in tails. Use warmup and separate cold-start metrics.

Mistake: Mismatched preprocessing

Keep the exact same resize, crop, normalization at train and serve time. Version your preprocessing alongside the model.

Mistake: Over-quantizing sensitive layers

Some layers (first/last, attention, small activations) are sensitive. Use mixed precision (FP16 for sensitive parts).

Mistake: No safe rollback

Without instant rollback you risk long outages. Keep N-1 hot, maintain versioned configs, and a manual 100% switch.

Mistake: Ignoring memory and thermal limits on edge

Measure peak RAM/VRAM and device temperature under load. Add throttles, lower resolution, or smaller models.

Debugging slow inference

Profile per-stage (decode, preprocess, model, postprocess).
Check batch size vs GPU utilization.
Use FP16/INT8, enable TensorRT tactics, pin memory, and avoid CPU-GPU ping-pong.

Mini project: Real-time object classification service with safe rollout

Pick a small pretrained classifier; export to ONNX (FP32) and build an FP16 TensorRT engine.
Implement a FastAPI service with /health, /ready, and /predict (image upload).
Add micro-batching (e.g., batch up to 8 for 10 ms) and measure p50/p95.
Create an INT8 model; run accuracy checks and adopt if delta is acceptable.
Implement canary routing between FP16 and INT8 versions (e.g., 90/10) with metrics by version.
Prepare rollback: a config flag to route 100% to the stable model; test it.
Edge profile: run on a low-power device or simulate CPU-only; adjust batch/resolution to meet SLO.

Deliverables checklist

API design doc with request/response schema and limits
Latency report with charts (p50/p95) across batch sizes and precisions
Accuracy comparison FP32 vs FP16 vs INT8
Canary and rollback plan with success/failure criteria

Practical projects

Retail shelf detector: batch nightly audits from photos; real-time mode for staff scanners.
Factory defect classifier: FP16 TensorRT on an edge GPU; logs to a central server when online.
Vehicle camera redaction: blur faces/plates in real time; fall back to CPU with reduced resolution.

Subskills

Inference API Design: Design REST/gRPC endpoints with clear contracts, health/readiness, timeouts, and structured errors.
Batch Versus Real Time Inference: Choose the right mode; use micro-batching for throughput while respecting latency SLOs.
Latency And Throughput Optimization: Profile, tune batch size and concurrency, add warmup, and remove bottlenecks.
Model Quantization Basics: Apply post-training quantization (INT8) and validate accuracy and speed trade-offs.
Acceleration With TensorRT Basics: Build and deploy FP16/INT8 engines from ONNX for GPUs.
Edge Deployment Basics: Optimize for limited compute, memory, and intermittent connectivity.
Model Versioning And Canary Releases: Version artifacts and traffic-split new models safely.
Rollback And Safe Releases: Define and test fast rollback paths with clear triggers.

Next steps

Automate benchmarks and regression checks in CI.
Add structured logging, tracing IDs, and per-version metrics dashboards.
Prepare disaster drills: force rollback, simulate degraded accuracy, and verify alerts.

Skill exam

Take the exam below to validate your understanding. Everyone can take it for free. If you are logged in, your progress and results will be saved.

Menu

Deployment And Model Serving

Table of Contents

What this skill covers for Computer Vision Engineers

Who this is for

Prerequisites

Learning path

Worked examples

1) FastAPI image inference microservice (CPU or GPU)

2) Batch inference job with micro-batching

3) Measuring latency distribution (p50/p95)

4) Post-training quantization (ONNX Runtime INT8)

5) TensorRT engine from ONNX (FP16)

6) Simple canary router (90% v1, 10% v2)

Drills and exercises

Common mistakes and debugging tips

Mini project: Real-time object classification service with safe rollout

Practical projects

Subskills

Next steps

Skill exam

Deployment And Model Serving — Skill Exam

Topics

Batch Versus Real Time Inference

Inference API Design

Latency And Throughput Optimization

Model Quantization Basics

Acceleration With TensorRT Basics

Edge Deployment Basics

Model Versioning And Canary Releases

Rollback And Safe Releases

Have questions about Deployment And Model Serving?

AI Assistant