Why this matters
As an Applied Scientist, you often need to ship models that are both accurate and efficient. Mixed precision (using lower-precision floating point like FP16/BF16 where safe) and quantization (mapping tensors to INT8 or lower) can:
- Cut training time significantly while keeping accuracy steady.
- Reduce inference latency and cloud costs by shrinking model memory and bandwidth needs.
- Enable on-device/edge deployment where memory and power are tight.
Typical tasks you will face:
- Choose FP16/BF16 for training to speed up large models.
- Quantize a model to INT8 for fast CPU or accelerator inference with minimal accuracy drop.
- Debug underflow/overflow issues from low precision and apply loss scaling or calibration.
Concept explained simply
Precision decides how many bits represent numbers. Fewer bits mean faster math and less memory, but also less numeric resolution. Mixed precision uses high precision where it matters (e.g., reductions) and lower precision elsewhere. Quantization stores and computes with integers by scaling real values into a fixed range.
Mental model: "Camera resolution" analogy
Think of numeric precision like camera resolution. High resolution (FP32) captures every detail but costs storage and time. Medium resolution (FP16/BF16) is lighter and still clear enough for most scenes. Quantization (INT8) is like a well-tuned compressed photo: smaller and fast to send, but you must calibrate compression to avoid visible artifacts.
Quick glossary
- FP32/FP16/BF16: 32/16-bit floating formats. BF16 keeps FP32-like exponent for greater dynamic range with 16-bit cost.
- INT8: 8-bit integer representation used after scaling (quantization).
- Loss scaling: Multiply the loss to keep small gradients from underflow in FP16, then unscale before updating.
- Static quantization: Calibrate with a sample dataset to determine activation scales.
- Dynamic quantization: Determine activation scales on-the-fly at runtime; weights are pre-quantized.
- Per-tensor vs per-channel: One scale for the whole tensor vs one scale per output channel (more accurate, slightly more overhead).
Worked examples
Example 1: Memory and bandwidth savings
Suppose a model has 200M parameters.
- FP32 weights ≈ 200M × 4 bytes = 800 MB
- FP16/BF16 weights ≈ 200M × 2 bytes = 400 MB
- INT8 weights ≈ 200M × 1 byte = 200 MB
If your inference is bandwidth-bound, moving from FP32 to FP16 can approach ~2× throughput; FP32 to INT8 can approach ~4×, assuming kernels and hardware support align.
Example 2: Mixed precision training with loss scaling
Workflow (conceptual):
- Autocast forward to FP16/BF16 for most layers; keep softmax/log-sum-exp, normalization, and reductions in FP32.
- Compute loss in FP32. Multiply by a scale S (e.g., 210).
- Backprop. If any gradient is Inf/NaN, skip the step and reduce S.
- Unscale gradients, apply clipping if needed, and update FP32 master weights.
Result: typically 1.5–3× faster training on modern GPUs with similar accuracy for many vision and NLP models.
Example 3: Post-training INT8 quantization (per-channel)
For a conv layer with 256 output channels, per-channel weight quantization assigns each output channel its own scale and zero-point. This better fits channel-specific distributions and often recovers 0.5–1.5% accuracy compared to per-tensor, with a small compute overhead. Model size drops to ~25% of FP32 size for quantized weights.
Example 4: Latency vs batch size
Quantization helps especially at batch size 1 (low latency). Mixed precision can also help throughput at larger batch sizes. If your service is latency-critical (e.g., real-time recommendations), prioritize INT8 with per-channel quantization and carefully calibrated activations. If your workload is throughput-heavy (batching allowed), mixed precision may already deliver much of the benefit with minimal accuracy work.
Step-by-step: choose the right precision
- Check hardware: Does it have Tensor Cores (FP16/BF16) or vector INT8 support? Use what the hardware accelerates best.
- Pick training precision: Start with BF16 (if available) or FP16 with loss scaling. Keep numerically sensitive ops in FP32.
- Pick inference precision: Try INT8 post-training quantization. If accuracy drops too much, use per-channel, better calibration, or fallback to FP16/BF16.
- Calibrate: For static quantization, run a few hundred to a few thousand representative samples to collect activation ranges.
- Validate: Compare latency, throughput, memory, and top-line metrics (e.g., accuracy, F1). Accept only if KPIs hold.
Exercises
These mirror the tasks below in the Exercises section (your answers won’t be auto-checked here). Use the checklist to verify your work.
Exercise 1 — Estimate memory and throughput gains
You are quantizing a 1.2B-parameter model for inference. Assume:
- Weights only: FP32=4 bytes, FP16=2 bytes, INT8=1 byte per parameter.
- Activations at batch size 1 consume ~4 GB in FP32. Assume they scale linearly with precision (2 GB FP16, 1 GB INT8).
- System is bandwidth-bound. Throughput scales roughly with bytes reduction.
Tasks:
- Compute weights memory for FP32, FP16, INT8.
- Compute activation memory for FP16 and INT8.
- Estimate throughput gain vs FP32 for FP16 and INT8.
Exercise 2 — Draft a mixed precision training plan
Given a transformer fine-tuning job on GPU with autocast support:
- List which operations should remain FP32.
- Describe how to use loss scaling and when to reduce the scale.
- Explain where to keep FP32 master weights and when to unscale and clip gradients.
- Add one check to detect numerical instability.
Checklist
- You calculated memory and throughput correctly in Exercise 1.
- Your plan in Exercise 2 keeps numerically sensitive ops in FP32.
- You included loss scaling, overflow detection, and FP32 master weights.
- You included gradient unscale-before-clip and after-clip update.
Common mistakes and self-check
- Turning everything into low precision: Keep softmax/log-sum-exp, normalization, and reductions in FP32. Self-check: Inspect kernel dtypes during forward.
- No loss scaling in FP16 training: Leads to zeroed gradients. Self-check: Watch for many zeros or sudden stalls in training.
- Poor calibration data: Using non-representative samples causes big INT8 accuracy drops. Self-check: Calibrate with real traffic-like inputs.
- Ignoring per-channel quantization: Per-tensor may underperform for conv/linear layers. Self-check: Try per-channel and compare accuracy.
- Skipping accuracy/latency measurement: Always A/B before rollout. Self-check: Log model metrics and p95 latency after changes.
Practical projects
- Project 1: Convert a ResNet or small Transformer to FP16/BF16 training, add loss scaling, and report speedup, memory, and final accuracy.
- Project 2: Post-training quantize a classification model to INT8 with per-channel weights. Compare top-1 accuracy and latency at batch size 1 vs FP16.
- Project 3: Build a small calibration pipeline: sample N=1000 representative inputs, run static calibration, and plot accuracy vs N.
Who this is for and prerequisites
- Who: Applied Scientists and ML Engineers deploying or training deep learning models under latency/cost constraints.
- Prerequisites: Familiarity with tensors, training loops, inference pipelines, and basic floating-point numerics.
Learning path
- Start with mixed precision training on your current model (BF16/FP16 + loss scaling).
- Quantize for inference (start with post-training INT8; add per-channel; calibrate well).
- Evaluate trade-offs (accuracy, latency, memory). Iterate.
Next steps
- Instrument your training to log dtypes per op and detect Inf/NaN early.
- Add automated calibration and evaluation scripts to your CI for quantized builds.
- Document precision policies (which ops stay FP32, acceptable accuracy deltas).
Mini challenge
You must deploy a text classification model on CPU for batch size 1 with tight latency. Outline a plan: choose INT8 with per-channel weights, static calibration with 1000 samples from production-like data, and define a rollback if accuracy drops >0.5%. State your acceptance criteria.
Quick Test (progress note)
You can take the Quick Test below. Everyone can use it for free; only logged-in users will have their progress saved.