What you'll learn
- Why quantization matters for computer vision deployment
- Core concepts: INT8/INT16, scale, zero-point, symmetric vs asymmetric
- PTQ (Post-Training Quantization) vs QAT (Quantization-Aware Training)
- Per-tensor vs per-channel quantization and when to use them
- Calibration, accuracy checks, and debugging tips
Who this is for
- Computer Vision Engineers shipping models to edge/mobile, servers with tight latency, or GPUs with batch size constraints
- ML practitioners preparing models for inference in production
Prerequisites
- Comfort with tensors, convolution layers, and ReLU/BatchNorm
- Basic numerical understanding (min/max, rounding)
- Familiarity with your serving backend (e.g., runtime that supports INT8)
Why this matters
Real tasks you will face:
- Meeting hard latency SLAs (e.g., < 10 ms per frame for real-time detection)
- Fitting models into limited memory on edge devices
- Reducing cloud inference cost by increasing throughput per GPU/CPU
Quantization shrinks model size and can speed up inference by using lower-precision arithmetic (e.g., INT8). Done right, accuracy loss is small.
Concept explained simply
Quantization maps floating-point values to integers with a scale and a zero-point. For UINT8:
- real_value β scale Γ (int_value β zero_point)
- scale is a small positive float; zero_point aligns zero in real space to an integer
Two common choices:
- Symmetric: zero_point = 0 (or center), better for weights with ranges centered around zero
- Asymmetric: non-zero zero_point, better for activations with positive ranges
Mental model
Think of quantization as choosing a ruler. scale is the tick spacing; zero_point decides where "0" sits. Per-channel rulers (one per output channel) fit each channel's spread better than a single global ruler.
Key terms (quick reference)
- PTQ: Quantize a trained FP32 model using calibration data
- QAT: Simulate quantization during training to recover accuracy
- Per-tensor vs per-channel: single scale/zero_point for whole tensor vs per output channel
- Calibration: estimating ranges (min/max or percentiles) to compute scale and zero_point
Worked examples
Example 1: Size reduction estimate
A 45M-parameter model stored as FP32 takes about 45M Γ 4 bytes β 180 MB. If you quantize weights to INT8, size β 45M Γ 1 byte β 45 MB. That's a 4Γ reduction. If activations are also INT8 at inference, you save bandwidth too.
Example 2: Latency intuition
On many CPUs/NPUs, INT8 operations have higher throughput than FP32 due to vectorized instructions and lower memory bandwidth pressure. If a layer is memory-bound, cutting activation and weight precision to 8-bit can reduce traffic by up to 4Γ, translating to noticeable latency improvements.
Example 3: Compute scale and zero-point (asymmetric UINT8)
Suppose activation observed range is min = β0.9, max = 5.0. For UINT8 (0β255):
- scale = (max β min) / 255 = (5.0 β (β0.9)) / 255 β 5.9 / 255 β 0.0231
- zero_point = round(βmin / scale) = round(0.9 / 0.0231) β round(38.9) = 39
- Quantize value 0.3: q = round(value / scale + zero_point) β round(0.3 / 0.0231 + 39) β 52
When to use what
- Weights: symmetric per-channel INT8 is a strong default for conv layers
- Activations: asymmetric per-tensor often works well
- PTQ: fastest path; try first with good calibration (hundreds to a few thousand samples)
- QAT: use when PTQ accuracy drop is unacceptable
Implementation playbook
- Choose target precision: start with INT8
- Check operator support in your inference runtime
- Collect calibration data representative of production (diverse, not just "easy" cases)
- Run PTQ with per-channel weights and percentile-based calibration for activations
- Measure accuracy and latency; compare to FP32 baseline
- If accuracy loss is too high, try: better calibration, outlier handling, or QAT
- Lock inference settings (threading, batch, I/O sizes) and re-measure before shipping
Accuracy guardrails
- Use a held-out validation set; track top-1/top-5 mAP, IoU, F1, or task-specific metrics
- Compare class-wise metrics to spot drift in rare classes
- Consider percentile clipping (e.g., 99.9%) for activations to reduce outlier impact
- Bias correction for weights can recover some accuracy
- For detection/segmentation, inspect qualitative outputs on hard samples
Debugging and self-check
- Layer-by-layer compare FP32 vs INT8 outputs to locate largest deltas
- Ensure consistent preprocessing between calibration and inference
- Check for unsupported ops silently falling back to FP32, hurting speed
- Verify per-channel quantization actually enabled for conv weights
Common mistakes and how to self-check
- Using too little calibration data: self-check by plotting activation histograms; if spiky or clipped, increase data diversity
- Mismatched input scaling at inference: self-check by running a few images through FP32 vs quantized preproc and comparing ranges
- Quantizing layers that do not benefit: self-check by excluding first/last layers and re-measuring accuracy
- Ignoring per-channel for conv weights: self-check runtime logs or model summary to confirm per-channel scales
Exercises you can do now
Note: Everyone can take exercises and the quick test. Only logged-in users will have their progress saved.
- Exercise 1 (mirrors task ex1): Compute model size after quantization and pick a quantization scheme for weights vs activations. See the Exercises section below for details.
- Exercise 2 (mirrors task ex2): Compute scale, zero-point, and quantized values for a small activation set with asymmetric UINT8.
Quick checklist before you start:
- I can explain scale and zero_point in one sentence
- I know when to choose symmetric vs asymmetric
- I can estimate size reduction and latency benefits
- I understand PTQ vs QAT trade-offs
Mini tasks
- Sketch a calibration plan: how many samples, from which scenarios, and why
- Decide which layers to exclude initially (e.g., first conv, last classifier)
- Write down your acceptance criteria: allowed accuracy drop and target latency
Mini challenge
Challenge brief
You have a MobileNet-based detector. FP32 mAP is 34.0. INT8 PTQ gives 33.4 mAP and 2.1Γ speedup on CPU. Product asks for at least 2Γ speedup and no more than 0.7 mAP drop. Propose a plan to meet both.
Hint: consider per-channel weights, percentile calibration for activations, excluding last layer from quantization, and light QAT if needed.
Practical projects
- Quantize a small classifier (e.g., CIFAR-like) with PTQ and compare per-tensor vs per-channel
- Run layer-wise error analysis on a quantized segmentation model and identify the worst 2 layers
- Implement a tiny QAT loop: fake-quantize activations and weights and retrain for a few epochs
Learning path
- Model Quantization Basics (this page)
- Backend-specific deployment (operators, kernels, and runtime flags)
- Mixed-precision and calibration strategies
- Quantization-aware training for accuracy recovery
- Profiling, monitoring, and rollback plans in production
Next steps
- Complete the exercises and mini challenge
- Take the quick test to confirm your grasp
- Apply PTQ on a model you already use and record baseline vs INT8 metrics