luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Model Quantization Basics

Learn Model Quantization Basics for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

What you'll learn

  • Why quantization matters for computer vision deployment
  • Core concepts: INT8/INT16, scale, zero-point, symmetric vs asymmetric
  • PTQ (Post-Training Quantization) vs QAT (Quantization-Aware Training)
  • Per-tensor vs per-channel quantization and when to use them
  • Calibration, accuracy checks, and debugging tips

Who this is for

  • Computer Vision Engineers shipping models to edge/mobile, servers with tight latency, or GPUs with batch size constraints
  • ML practitioners preparing models for inference in production

Prerequisites

  • Comfort with tensors, convolution layers, and ReLU/BatchNorm
  • Basic numerical understanding (min/max, rounding)
  • Familiarity with your serving backend (e.g., runtime that supports INT8)

Why this matters

Real tasks you will face:

  • Meeting hard latency SLAs (e.g., < 10 ms per frame for real-time detection)
  • Fitting models into limited memory on edge devices
  • Reducing cloud inference cost by increasing throughput per GPU/CPU

Quantization shrinks model size and can speed up inference by using lower-precision arithmetic (e.g., INT8). Done right, accuracy loss is small.

Concept explained simply

Quantization maps floating-point values to integers with a scale and a zero-point. For UINT8:

  • real_value β‰ˆ scale Γ— (int_value βˆ’ zero_point)
  • scale is a small positive float; zero_point aligns zero in real space to an integer

Two common choices:

  • Symmetric: zero_point = 0 (or center), better for weights with ranges centered around zero
  • Asymmetric: non-zero zero_point, better for activations with positive ranges

Mental model

Think of quantization as choosing a ruler. scale is the tick spacing; zero_point decides where "0" sits. Per-channel rulers (one per output channel) fit each channel's spread better than a single global ruler.

Key terms (quick reference)

  • PTQ: Quantize a trained FP32 model using calibration data
  • QAT: Simulate quantization during training to recover accuracy
  • Per-tensor vs per-channel: single scale/zero_point for whole tensor vs per output channel
  • Calibration: estimating ranges (min/max or percentiles) to compute scale and zero_point

Worked examples

Example 1: Size reduction estimate

A 45M-parameter model stored as FP32 takes about 45M Γ— 4 bytes β‰ˆ 180 MB. If you quantize weights to INT8, size β‰ˆ 45M Γ— 1 byte β‰ˆ 45 MB. That's a 4Γ— reduction. If activations are also INT8 at inference, you save bandwidth too.

Example 2: Latency intuition

On many CPUs/NPUs, INT8 operations have higher throughput than FP32 due to vectorized instructions and lower memory bandwidth pressure. If a layer is memory-bound, cutting activation and weight precision to 8-bit can reduce traffic by up to 4Γ—, translating to noticeable latency improvements.

Example 3: Compute scale and zero-point (asymmetric UINT8)

Suppose activation observed range is min = βˆ’0.9, max = 5.0. For UINT8 (0–255):

  1. scale = (max βˆ’ min) / 255 = (5.0 βˆ’ (βˆ’0.9)) / 255 β‰ˆ 5.9 / 255 β‰ˆ 0.0231
  2. zero_point = round(βˆ’min / scale) = round(0.9 / 0.0231) β‰ˆ round(38.9) = 39
  3. Quantize value 0.3: q = round(value / scale + zero_point) β‰ˆ round(0.3 / 0.0231 + 39) β‰ˆ 52

When to use what

  • Weights: symmetric per-channel INT8 is a strong default for conv layers
  • Activations: asymmetric per-tensor often works well
  • PTQ: fastest path; try first with good calibration (hundreds to a few thousand samples)
  • QAT: use when PTQ accuracy drop is unacceptable

Implementation playbook

  1. Choose target precision: start with INT8
  2. Check operator support in your inference runtime
  3. Collect calibration data representative of production (diverse, not just "easy" cases)
  4. Run PTQ with per-channel weights and percentile-based calibration for activations
  5. Measure accuracy and latency; compare to FP32 baseline
  6. If accuracy loss is too high, try: better calibration, outlier handling, or QAT
  7. Lock inference settings (threading, batch, I/O sizes) and re-measure before shipping

Accuracy guardrails

  • Use a held-out validation set; track top-1/top-5 mAP, IoU, F1, or task-specific metrics
  • Compare class-wise metrics to spot drift in rare classes
  • Consider percentile clipping (e.g., 99.9%) for activations to reduce outlier impact
  • Bias correction for weights can recover some accuracy
  • For detection/segmentation, inspect qualitative outputs on hard samples

Debugging and self-check

  • Layer-by-layer compare FP32 vs INT8 outputs to locate largest deltas
  • Ensure consistent preprocessing between calibration and inference
  • Check for unsupported ops silently falling back to FP32, hurting speed
  • Verify per-channel quantization actually enabled for conv weights

Common mistakes and how to self-check

  • Using too little calibration data: self-check by plotting activation histograms; if spiky or clipped, increase data diversity
  • Mismatched input scaling at inference: self-check by running a few images through FP32 vs quantized preproc and comparing ranges
  • Quantizing layers that do not benefit: self-check by excluding first/last layers and re-measuring accuracy
  • Ignoring per-channel for conv weights: self-check runtime logs or model summary to confirm per-channel scales

Exercises you can do now

Note: Everyone can take exercises and the quick test. Only logged-in users will have their progress saved.

  1. Exercise 1 (mirrors task ex1): Compute model size after quantization and pick a quantization scheme for weights vs activations. See the Exercises section below for details.
  2. Exercise 2 (mirrors task ex2): Compute scale, zero-point, and quantized values for a small activation set with asymmetric UINT8.

Quick checklist before you start:

  • I can explain scale and zero_point in one sentence
  • I know when to choose symmetric vs asymmetric
  • I can estimate size reduction and latency benefits
  • I understand PTQ vs QAT trade-offs

Mini tasks

  • Sketch a calibration plan: how many samples, from which scenarios, and why
  • Decide which layers to exclude initially (e.g., first conv, last classifier)
  • Write down your acceptance criteria: allowed accuracy drop and target latency

Mini challenge

Challenge brief

You have a MobileNet-based detector. FP32 mAP is 34.0. INT8 PTQ gives 33.4 mAP and 2.1Γ— speedup on CPU. Product asks for at least 2Γ— speedup and no more than 0.7 mAP drop. Propose a plan to meet both.

Hint: consider per-channel weights, percentile calibration for activations, excluding last layer from quantization, and light QAT if needed.

Practical projects

  • Quantize a small classifier (e.g., CIFAR-like) with PTQ and compare per-tensor vs per-channel
  • Run layer-wise error analysis on a quantized segmentation model and identify the worst 2 layers
  • Implement a tiny QAT loop: fake-quantize activations and weights and retrain for a few epochs

Learning path

  1. Model Quantization Basics (this page)
  2. Backend-specific deployment (operators, kernels, and runtime flags)
  3. Mixed-precision and calibration strategies
  4. Quantization-aware training for accuracy recovery
  5. Profiling, monitoring, and rollback plans in production

Next steps

  • Complete the exercises and mini challenge
  • Take the quick test to confirm your grasp
  • Apply PTQ on a model you already use and record baseline vs INT8 metrics

Practice Exercises

2 exercises to complete

Instructions

You have a 75M-parameter vision model. In FP32, each parameter is 4 bytes. You plan to use INT8 for weights and keep activations INT8 at inference.

  1. Estimate model file size after quantizing weights to INT8.
  2. State the default scheme you would choose for: (a) convolution weights, (b) activations. Briefly justify.
Expected Output
Approximate model size after quantization and chosen schemes for weights and activations with 1–2 sentence rationale.

Model Quantization Basics β€” Quick Test

Test your knowledge with 9 questions. Pass with 70% or higher.

9 questions70% to pass

Have questions about Model Quantization Basics?

AI Assistant

Ask questions about this tool