Why this matters
Mixed precision training lets you train vision models faster and fit larger batches on the same GPU by using lower-precision math where it is safe. For Computer Vision Engineers, this means:
- Shorter experiment cycles when tuning backbones and augmentations.
- Training larger models (e.g., high-res segmentation, multi-scale detection) on limited VRAM.
- Lower cloud costs by making better use of available hardware.
Real tasks you’ll face:
- Fitting a Mask R-CNN on a 12–16 GB GPU without reducing image size.
- Speeding up ResNet/ViT classification training to iterate on data augmentation quickly.
- Stabilizing training when FP16 causes NaNs by switching the right ops back to FP32 or using BF16.
Note: The quick test is available to everyone; only logged-in users get saved progress.
Concept explained simply
Idea: use 16-bit numbers (FP16 or BF16) for most compute to go faster and use less memory, but keep a safe FP32 copy of weights for accuracy. If small gradients underflow in FP16, scale them up temporarily (loss scaling) to keep information.
Mental model
- Two buckets of math: fast bucket (16-bit) and safe bucket (32-bit).
- Autocast decides which ops can run in 16-bit without hurting stability.
- A master FP32 weight copy is updated by the optimizer for accuracy.
- Loss scaling prevents tiny gradients from vanishing in FP16.
Core components you will use
- Autocast region: runs most layers (conv, matmul) in FP16/BF16; critical reductions happen in FP32.
- Master FP32 weights: optimizer updates happen in FP32 even if forward was 16-bit.
- Loss scaling: static or dynamic; dynamic usually safer and automatic in modern frameworks.
- Precision choice:
- FP16: faster on many NVIDIA GPUs, smaller range; may need more careful scaling.
- BF16: similar speed, larger exponent range, fewer overflows, supported on many newer GPUs/TPUs/CPUs.
- Ops that often prefer FP32: softmax+log combos, exp/log, variance/mean stats, normalizations, division by small numbers, some NMS and indexing ops.
Worked examples
Example 1: PyTorch image classification with AMP (FP16)
import torch
from torch import nn, optim
from torch.cuda.amp import autocast, GradScaler
model = ... # e.g., ResNet50
model = model.to('cuda')
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
criterion = nn.CrossEntropyLoss()
scaler = GradScaler() # dynamic loss scaling
for images, labels in train_loader:
images = images.to('cuda', non_blocking=True)
labels = labels.to('cuda', non_blocking=True)
optimizer.zero_grad(set_to_none=True)
with autocast():
logits = model(images)
loss = criterion(logits, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
- Expected: 1.2–1.6× training speedup and 30–50% less VRAM use (rough guideline).
- If you see NaNs, start by checking data, then try BF16 or keep sensitive ops in FP32.
Example 2: TensorFlow/Keras segmentation with mixed_float16
import tensorflow as tf
from tensorflow import keras
# Enable mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
inputs = keras.Input(shape=(512,512,3))
x = keras.layers.Conv2D(32, 3, activation='relu')(inputs)
# ... your UNet/DeepLab-like model ...
outputs = keras.layers.Conv2D(21, 1, dtype='float32', activation='softmax')(x) # keep numerically sensitive output in float32
model = keras.Model(inputs, outputs)
opt = keras.optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_ds, validation_data=val_ds, epochs=20)
- Policy applies mixed precision automatically; loss scaling is handled for you.
- For final logits/softmax, outputting float32 can improve numeric stability.
Example 3: Object detection with selective autocast
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
model = ... # detection model
for images, targets in loader:
images = [img.cuda(non_blocking=True) for img in images]
optimizer.zero_grad(set_to_none=True)
with autocast():
losses = model(images, targets) # returns dict of loss terms in many detection libs
loss = sum(losses.values())
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# If post-processing like NMS breaks, move it out of autocast or cast to float32
with torch.cuda.amp.autocast(enabled=False):
# nms_logits = logits.float()
pass
- Tip: Keep post-processing (NMS, top-k indexing) in float32 if you see inconsistent results.
Performance and stability playbook
- Establish a FP32 baseline for loss/accuracy and throughput.
- Enable mixed precision (FP16 or BF16) and re-check loss curves match early epochs.
- If NaNs or divergence appear:
- Switch to BF16 if supported.
- Ensure dynamic loss scaling is on; try gradient clipping (e.g., norm 1.0).
- Run sensitive ops in float32 (softmax/logits, normalization stats).
- Profile: confirm higher GPU utilization and lower memory. Increase batch size if VRAM allows.
- Converged? Compare final metrics to FP32. Aim for parity within noise.
Common mistakes and self-check
- Mistake: Forgetting master FP32 weights. Symptom: accuracy drop. Fix: use built-in AMP/keras policy; do not manually cast model weights to half.
- Mistake: Keeping all ops in 16-bit. Symptom: NaNs/Inf. Fix: run reductions/normalizations/softmax in float32.
- Mistake: Disabled or static loss scaling too small. Symptom: underflow, stalled learning. Fix: dynamic scaling or increase static scale.
- Mistake: Comparing to a weak FP32 baseline. Symptom: misleading speedup. Fix: use the same data pipeline, cudnn benchmarking, etc.
Self-check
- Do your mixed-precision curves track FP32 in the first few epochs?
- Is throughput at least ~1.2Ă— faster or batch size larger?
- No NaNs in loss or gradients after 1000 steps?
Practical projects
- Project 1: ResNet50 on a medium dataset (e.g., 100–200 classes). Goal: reach FP32 accuracy within 0.2% using AMP and increase batch size 1.5×.
- Project 2: UNet segmentation at 512Ă—512. Goal: fit a larger batch or higher resolution using mixed precision; verify IoU parity with FP32.
- Project 3: Custom detector with NMS. Goal: ensure NMS and post-processing run in float32 if needed; validate precision/recall parity.
Exercises
Do these in order, then take the quick test.
Exercise 1: Convert a PyTorch loop to AMP
Start with a working FP32 loop. Add autocast and GradScaler properly.
- Use autocast only around forward and loss.
- Scale, backward, step via scaler; then update.
- Confirm no NaNs and measure speed/memory.
Exercise 2: Enable mixed precision in Keras and stabilize output
Turn on mixed_float16 policy, then ensure final logits or softmax computations are float32 if training becomes unstable.
- Set global policy.
- Cast output layer dtype to float32.
- Confirm val accuracy matches FP32 within noise.
- [Checklist] I compared FP32 vs mixed precision curves.
- [Checklist] I verified no NaNs/Infs for 1k steps.
- [Checklist] I measured speed and VRAM gains.
Mini challenge
Take a model that currently fits batch size 16 in FP32. With mixed precision, target batch size 24–32 while keeping accuracy within 0.2% of baseline. Write down which ops you kept at float32 and why.
Who this is for
- Computer Vision Engineers training CNN/Transformer models on GPUs or accelerators.
- ML practitioners wanting faster iteration and larger batch sizes.
Prerequisites
- Comfortable with PyTorch or TensorFlow training loops.
- Basic understanding of floating-point arithmetic and training stability.
- Ability to profile GPU memory and throughput.
Learning path
- Review your FP32 training loop and record baseline metrics.
- Enable AMP (PyTorch) or mixed_float16 (Keras) and verify parity.
- Handle instability: BF16, dynamic loss scaling, float32-sensitive ops.
- Profile, then increase batch size or resolution.
- Automate checks: NaN detection, gradient norms, convergence tracking.
Next steps
- Integrate mixed precision into your default training templates.
- Add automatic fallbacks (BF16/FP32 for sensitive paths).
- Combine with gradient checkpointing or distributed training for further gains.