How to learn Model Quantization Basics for Deployment And Model Serving in Computer Vision Engineer for free

What you'll learn

Why quantization matters for computer vision deployment
Core concepts: INT8/INT16, scale, zero-point, symmetric vs asymmetric
PTQ (Post-Training Quantization) vs QAT (Quantization-Aware Training)
Per-tensor vs per-channel quantization and when to use them
Calibration, accuracy checks, and debugging tips

Who this is for

Computer Vision Engineers shipping models to edge/mobile, servers with tight latency, or GPUs with batch size constraints
ML practitioners preparing models for inference in production

Prerequisites

Comfort with tensors, convolution layers, and ReLU/BatchNorm
Basic numerical understanding (min/max, rounding)
Familiarity with your serving backend (e.g., runtime that supports INT8)

Why this matters

Real tasks you will face:

Meeting hard latency SLAs (e.g., < 10 ms per frame for real-time detection)
Fitting models into limited memory on edge devices
Reducing cloud inference cost by increasing throughput per GPU/CPU

Quantization shrinks model size and can speed up inference by using lower-precision arithmetic (e.g., INT8). Done right, accuracy loss is small.

Concept explained simply

Quantization maps floating-point values to integers with a scale and a zero-point. For UINT8:

real_value ≈ scale × (int_value − zero_point)
scale is a small positive float; zero_point aligns zero in real space to an integer

Two common choices:

Symmetric: zero_point = 0 (or center), better for weights with ranges centered around zero
Asymmetric: non-zero zero_point, better for activations with positive ranges

Mental model

Think of quantization as choosing a ruler. scale is the tick spacing; zero_point decides where "0" sits. Per-channel rulers (one per output channel) fit each channel's spread better than a single global ruler.

Key terms (quick reference)

PTQ: Quantize a trained FP32 model using calibration data
QAT: Simulate quantization during training to recover accuracy
Per-tensor vs per-channel: single scale/zero_point for whole tensor vs per output channel
Calibration: estimating ranges (min/max or percentiles) to compute scale and zero_point

Worked examples

Example 1: Size reduction estimate

A 45M-parameter model stored as FP32 takes about 45M × 4 bytes ≈ 180 MB. If you quantize weights to INT8, size ≈ 45M × 1 byte ≈ 45 MB. That's a 4× reduction. If activations are also INT8 at inference, you save bandwidth too.

Example 2: Latency intuition

On many CPUs/NPUs, INT8 operations have higher throughput than FP32 due to vectorized instructions and lower memory bandwidth pressure. If a layer is memory-bound, cutting activation and weight precision to 8-bit can reduce traffic by up to 4×, translating to noticeable latency improvements.

Example 3: Compute scale and zero-point (asymmetric UINT8)

Suppose activation observed range is min = −0.9, max = 5.0. For UINT8 (0–255):

scale = (max − min) / 255 = (5.0 − (−0.9)) / 255 ≈ 5.9 / 255 ≈ 0.0231
zero_point = round(−min / scale) = round(0.9 / 0.0231) ≈ round(38.9) = 39
Quantize value 0.3: q = round(value / scale + zero_point) ≈ round(0.3 / 0.0231 + 39) ≈ 52

When to use what

Weights: symmetric per-channel INT8 is a strong default for conv layers
Activations: asymmetric per-tensor often works well
PTQ: fastest path; try first with good calibration (hundreds to a few thousand samples)
QAT: use when PTQ accuracy drop is unacceptable

Implementation playbook

Choose target precision: start with INT8
Check operator support in your inference runtime
Collect calibration data representative of production (diverse, not just "easy" cases)
Run PTQ with per-channel weights and percentile-based calibration for activations
Measure accuracy and latency; compare to FP32 baseline
If accuracy loss is too high, try: better calibration, outlier handling, or QAT
Lock inference settings (threading, batch, I/O sizes) and re-measure before shipping

Accuracy guardrails

Use a held-out validation set; track top-1/top-5 mAP, IoU, F1, or task-specific metrics
Compare class-wise metrics to spot drift in rare classes
Consider percentile clipping (e.g., 99.9%) for activations to reduce outlier impact
Bias correction for weights can recover some accuracy
For detection/segmentation, inspect qualitative outputs on hard samples

Debugging and self-check

Layer-by-layer compare FP32 vs INT8 outputs to locate largest deltas
Ensure consistent preprocessing between calibration and inference
Check for unsupported ops silently falling back to FP32, hurting speed
Verify per-channel quantization actually enabled for conv weights

Common mistakes and how to self-check

Using too little calibration data: self-check by plotting activation histograms; if spiky or clipped, increase data diversity
Mismatched input scaling at inference: self-check by running a few images through FP32 vs quantized preproc and comparing ranges
Quantizing layers that do not benefit: self-check by excluding first/last layers and re-measuring accuracy
Ignoring per-channel for conv weights: self-check runtime logs or model summary to confirm per-channel scales

Exercises you can do now

Note: Everyone can take exercises and the quick test. Only logged-in users will have their progress saved.

Exercise 1 (mirrors task ex1): Compute model size after quantization and pick a quantization scheme for weights vs activations. See the Exercises section below for details.
Exercise 2 (mirrors task ex2): Compute scale, zero-point, and quantized values for a small activation set with asymmetric UINT8.

Quick checklist before you start:

I can explain scale and zero_point in one sentence
I know when to choose symmetric vs asymmetric
I can estimate size reduction and latency benefits
I understand PTQ vs QAT trade-offs

Mini tasks

Sketch a calibration plan: how many samples, from which scenarios, and why
Decide which layers to exclude initially (e.g., first conv, last classifier)
Write down your acceptance criteria: allowed accuracy drop and target latency

Mini challenge

Challenge brief

You have a MobileNet-based detector. FP32 mAP is 34.0. INT8 PTQ gives 33.4 mAP and 2.1× speedup on CPU. Product asks for at least 2× speedup and no more than 0.7 mAP drop. Propose a plan to meet both.

Hint: consider per-channel weights, percentile calibration for activations, excluding last layer from quantization, and light QAT if needed.

Practical projects

Quantize a small classifier (e.g., CIFAR-like) with PTQ and compare per-tensor vs per-channel
Run layer-wise error analysis on a quantized segmentation model and identify the worst 2 layers
Implement a tiny QAT loop: fake-quantize activations and weights and retrain for a few epochs

Learning path

Model Quantization Basics (this page)
Backend-specific deployment (operators, kernels, and runtime flags)
Mixed-precision and calibration strategies
Quantization-aware training for accuracy recovery
Profiling, monitoring, and rollback plans in production

Next steps

Complete the exercises and mini challenge
Take the quick test to confirm your grasp
Apply PTQ on a model you already use and record baseline vs INT8 metrics

Menu

Model Quantization Basics

Table of Contents