Why this matters
Quantization shrinks model size and speeds up inference by representing numbers with fewer bits (for example, from float32 to int8 or int4). In NLP serving, this means fitting bigger models on limited hardware, lowering latency, and cutting costs.
- Serve a 7B parameter model on a single GPU by moving from 16-bit to 8-bit or 4-bit weights.
- Hit strict latency SLOs for chatbots, search rerankers, and content moderation pipelines.
- Increase throughput per machine for batch scoring of documents or logs.
Real tasks you will do
- Quantize a transformer to int8 and validate accuracy with a small calibration set.
- Decide when to use dynamic, static (post-training), or QAT.
- Measure memory savings and latency improvements and compare against quality metrics.
Concept explained simply
Quantization maps real-valued tensors to a smaller set of integers. At inference, operations run in lower precision, using less memory and compute.
Basic mapping uses two parameters per tensor (or per channel): scale and zero_point.
Real value r is approximated by: r ≈ scale × (q − zero_point) where q is an integer (for example −128..127 for int8).
Mental model
- Think of compressing a photo: fewer colors → smaller size, slight loss in quality. Quantization does the same to numbers in your model.
- Quality depends on how well the chosen integer range covers real data. Good calibration = wiser use of limited codes.
- Speedups come from cheaper integer math and lower memory bandwidth.
Core building blocks
- Precision: float32, float16/bfloat16, int8, int4.
- Weight-only vs weight+activation quantization.
- Per-tensor vs per-channel: one scale for entire tensor vs each output channel gets its own scale (often better for conv/linear weights).
- Symmetric (zero_point = 0) vs asymmetric (non-zero zero_point) quantization.
- Calibration: pick representative samples to estimate ranges for activations.
Common flavors and when to use
- Dynamic quantization: quantize weights offline; activations quantized on-the-fly. Fast to apply, great first step on CPUs for linear-heavy models.
- Static (post-training) quantization: use a calibration set to fix activation ranges. Better accuracy than dynamic.
- Quantization-aware training (QAT): simulate quantization during training/fine-tuning. Best accuracy, more work.
Worked examples
Example 1: Memory math for a large model
Suppose a 7B parameter model.
- fp16 (2 bytes): 7B × 2 B ≈ 14 GB
- int8 (1 byte): 7B × 1 B ≈ 7 GB
- int4 (0.5 byte): 7B × 0.5 B ≈ 3.5 GB
Why it helps
Less memory reduces paging and increases effective batch size, improving throughput and latency.
Example 2: Mapping a real value to int8
Given symmetric int8, range −128..127. If the tensor real range is approximately −6.35..6.35, then scale ≈ 6.35/127 ≈ 0.05, zero_point = 0. For r = 2.1, q = round(r/scale) = round(2.1/0.05) = round(42) = 42. Dequantized r' = q × scale ≈ 2.1.
What if it clips?
If r exceeds the representable range, q saturates at −128 or 127. Good calibration reduces clipping.
Example 3: Static quantization with calibration
- Collect: 512–2048 representative texts covering typical lengths and vocab.
- Run: Pass them through the unquantized model to log activation ranges.
- Freeze: Compute scales and zero_points.
- Quantize: Convert weights and add quant/dequant nodes or fusions.
- Validate: Compare metrics (accuracy, BLEU, F1, or perplexity) and latency.
Tip
Include edge cases like very short and very long inputs to avoid activation outliers later.
Example 4: Throughput impact
If your pipeline is bandwidth-bound, halving model size can yield close to 2× tokens/sec. If it is compute-bound, integer kernels may still give 1.2×–1.5× speedups. Results vary by hardware and kernels.
Choosing a scheme
- Need quick wins on CPU for classification or reranking? Try dynamic int8 on linear layers.
- Serving on GPU with tight quality needs? Try static int8 with good calibration or weight-only 4-bit for large LLMs.
- Mission-critical quality? Use QAT or selective quantization (skip sensitive layers like embeddings or final layer norm).
- Checklist before you begin:
- ☐ Target hardware understood (integer kernel support present).
- ☐ SLOs defined (latency, throughput, memory cap).
- ☐ Representative calibration set prepared.
- ☐ Quality metric chosen and baseline recorded.
How to validate quality
- Run baseline (unquantized) and record metrics and latency.
- Quantize and re-run on the same data.
- Compare accuracy/quality deltas and latency improvements.
- Spot-check failure cases and long inputs.
Useful quick checks
- Cosine similarity of layer outputs before/after quantization (on a few samples).
- Perplexity change for language models on a small validation set.
Exercises
Complete these here, then open the solutions below when ready.
Exercise 1 — Memory and latency budgeting
You have a 1.3B-parameter transformer for an FAQ bot. Estimate memory at fp16, int8, and int4. Assume the app is memory-bandwidth-bound; how might latency change when moving from fp16 to int8?
Exercise 2 — Scale and zero_point math
Use symmetric int8. Given tensor real range of −3.2..2.8, pick scale = max(|min|,|max|)/127. Compute q for r = 1.6 and the dequantized r'.
- Exercise checklist:
- ☐ Show numeric steps, not just final numbers.
- ☐ State any assumptions you make.
- ☐ Note potential error sources (clipping, rounding).
Common mistakes and self-check
- Too small calibration set → activation clipping in production. Self-check: compare activation histograms from calibration vs production logs on a sample.
- Quantizing everything blindly. Self-check: try skipping embeddings or final layer norm and re-measure.
- Ignoring long-tail inputs. Self-check: include very long and very short sequences in calibration.
- Mismatched kernels on hardware. Self-check: confirm integer kernel availability and that they are actually used (profiling).
Quick recovery playbook
- Quality drop too high? Switch to per-channel weights or asymmetric quant for activations.
- Latency not improving? You may be compute-bound; test batch size and check kernel fusions.
Practical projects
- Build a small calibration set (1–2k texts) from your domain.
- Quantize a sentiment classifier to int8 with per-channel weight quantization.
- Report metrics: accuracy delta (target ≤ 0.5% drop), latency 95th percentile, and memory usage.
- Selective quantization: skip embeddings and final layer norm; compare.
- Shadow test: serve quantized model alongside baseline for a day; compare logs.
Who this is for
NLP engineers and ML practitioners deploying transformer models who need faster, cheaper inference without major quality loss.
Prerequisites
- Comfort with tensors and basic linear layers.
- Ability to run inference and measure latency/throughput.
- Basic understanding of your model’s evaluation metric.
Learning path
- Quantization basics (this lesson).
- Calibration and static quantization.
- Selective quantization and per-channel strategies.
- Advanced: 4-bit weight-only methods and QAT.
- Production validation: canary, shadow, rollback.
Next steps
- Apply static int8 to a real model with a representative calibration set.
- Measure accuracy change and latency improvement; iterate on per-channel and asymmetric options.
- Consider 4-bit weight-only if memory is still a bottleneck.
Mini challenge
Design a quantization plan for a 3B-parameter reranker with a 50 ms P95 latency target on CPU-only hardware. Specify: scheme (dynamic/static/QAT), per-tensor vs per-channel, calibration data size and composition, metrics to monitor, and a rollback trigger. Keep your accuracy drop under 0.5% if possible.
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.