luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Mixed Precision Training

Learn Mixed Precision Training for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Mixed precision training lets you train vision models faster and fit larger batches on the same GPU by using lower-precision math where it is safe. For Computer Vision Engineers, this means:

  • Shorter experiment cycles when tuning backbones and augmentations.
  • Training larger models (e.g., high-res segmentation, multi-scale detection) on limited VRAM.
  • Lower cloud costs by making better use of available hardware.

Real tasks you’ll face:

  • Fitting a Mask R-CNN on a 12–16 GB GPU without reducing image size.
  • Speeding up ResNet/ViT classification training to iterate on data augmentation quickly.
  • Stabilizing training when FP16 causes NaNs by switching the right ops back to FP32 or using BF16.

Note: The quick test is available to everyone; only logged-in users get saved progress.

Concept explained simply

Idea: use 16-bit numbers (FP16 or BF16) for most compute to go faster and use less memory, but keep a safe FP32 copy of weights for accuracy. If small gradients underflow in FP16, scale them up temporarily (loss scaling) to keep information.

Mental model
  • Two buckets of math: fast bucket (16-bit) and safe bucket (32-bit).
  • Autocast decides which ops can run in 16-bit without hurting stability.
  • A master FP32 weight copy is updated by the optimizer for accuracy.
  • Loss scaling prevents tiny gradients from vanishing in FP16.

Core components you will use

  • Autocast region: runs most layers (conv, matmul) in FP16/BF16; critical reductions happen in FP32.
  • Master FP32 weights: optimizer updates happen in FP32 even if forward was 16-bit.
  • Loss scaling: static or dynamic; dynamic usually safer and automatic in modern frameworks.
  • Precision choice:
    • FP16: faster on many NVIDIA GPUs, smaller range; may need more careful scaling.
    • BF16: similar speed, larger exponent range, fewer overflows, supported on many newer GPUs/TPUs/CPUs.
  • Ops that often prefer FP32: softmax+log combos, exp/log, variance/mean stats, normalizations, division by small numbers, some NMS and indexing ops.

Worked examples

Example 1: PyTorch image classification with AMP (FP16)
import torch
from torch import nn, optim
from torch.cuda.amp import autocast, GradScaler

model = ...  # e.g., ResNet50
model = model.to('cuda')
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
criterion = nn.CrossEntropyLoss()
scaler = GradScaler()  # dynamic loss scaling

for images, labels in train_loader:
    images = images.to('cuda', non_blocking=True)
    labels = labels.to('cuda', non_blocking=True)
    optimizer.zero_grad(set_to_none=True)

    with autocast():
        logits = model(images)
        loss = criterion(logits, labels)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
  • Expected: 1.2–1.6Ă— training speedup and 30–50% less VRAM use (rough guideline).
  • If you see NaNs, start by checking data, then try BF16 or keep sensitive ops in FP32.
Example 2: TensorFlow/Keras segmentation with mixed_float16
import tensorflow as tf
from tensorflow import keras

# Enable mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

inputs = keras.Input(shape=(512,512,3))
x = keras.layers.Conv2D(32, 3, activation='relu')(inputs)
# ... your UNet/DeepLab-like model ...
outputs = keras.layers.Conv2D(21, 1, dtype='float32', activation='softmax')(x)  # keep numerically sensitive output in float32
model = keras.Model(inputs, outputs)

opt = keras.optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(train_ds, validation_data=val_ds, epochs=20)
  • Policy applies mixed precision automatically; loss scaling is handled for you.
  • For final logits/softmax, outputting float32 can improve numeric stability.
Example 3: Object detection with selective autocast
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
model = ...  # detection model

for images, targets in loader:
    images = [img.cuda(non_blocking=True) for img in images]
    optimizer.zero_grad(set_to_none=True)

    with autocast():
        losses = model(images, targets)  # returns dict of loss terms in many detection libs
        loss = sum(losses.values())

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    # If post-processing like NMS breaks, move it out of autocast or cast to float32
    with torch.cuda.amp.autocast(enabled=False):
        # nms_logits = logits.float()
        pass
  • Tip: Keep post-processing (NMS, top-k indexing) in float32 if you see inconsistent results.

Performance and stability playbook

  1. Establish a FP32 baseline for loss/accuracy and throughput.
  2. Enable mixed precision (FP16 or BF16) and re-check loss curves match early epochs.
  3. If NaNs or divergence appear:
    • Switch to BF16 if supported.
    • Ensure dynamic loss scaling is on; try gradient clipping (e.g., norm 1.0).
    • Run sensitive ops in float32 (softmax/logits, normalization stats).
  4. Profile: confirm higher GPU utilization and lower memory. Increase batch size if VRAM allows.
  5. Converged? Compare final metrics to FP32. Aim for parity within noise.

Common mistakes and self-check

  • Mistake: Forgetting master FP32 weights. Symptom: accuracy drop. Fix: use built-in AMP/keras policy; do not manually cast model weights to half.
  • Mistake: Keeping all ops in 16-bit. Symptom: NaNs/Inf. Fix: run reductions/normalizations/softmax in float32.
  • Mistake: Disabled or static loss scaling too small. Symptom: underflow, stalled learning. Fix: dynamic scaling or increase static scale.
  • Mistake: Comparing to a weak FP32 baseline. Symptom: misleading speedup. Fix: use the same data pipeline, cudnn benchmarking, etc.
Self-check
  • Do your mixed-precision curves track FP32 in the first few epochs?
  • Is throughput at least ~1.2Ă— faster or batch size larger?
  • No NaNs in loss or gradients after 1000 steps?

Practical projects

  • Project 1: ResNet50 on a medium dataset (e.g., 100–200 classes). Goal: reach FP32 accuracy within 0.2% using AMP and increase batch size 1.5Ă—.
  • Project 2: UNet segmentation at 512Ă—512. Goal: fit a larger batch or higher resolution using mixed precision; verify IoU parity with FP32.
  • Project 3: Custom detector with NMS. Goal: ensure NMS and post-processing run in float32 if needed; validate precision/recall parity.

Exercises

Do these in order, then take the quick test.

Exercise 1: Convert a PyTorch loop to AMP

Start with a working FP32 loop. Add autocast and GradScaler properly.

  • Use autocast only around forward and loss.
  • Scale, backward, step via scaler; then update.
  • Confirm no NaNs and measure speed/memory.
Exercise 2: Enable mixed precision in Keras and stabilize output

Turn on mixed_float16 policy, then ensure final logits or softmax computations are float32 if training becomes unstable.

  • Set global policy.
  • Cast output layer dtype to float32.
  • Confirm val accuracy matches FP32 within noise.
  • [Checklist] I compared FP32 vs mixed precision curves.
  • [Checklist] I verified no NaNs/Infs for 1k steps.
  • [Checklist] I measured speed and VRAM gains.

Mini challenge

Take a model that currently fits batch size 16 in FP32. With mixed precision, target batch size 24–32 while keeping accuracy within 0.2% of baseline. Write down which ops you kept at float32 and why.

Who this is for

  • Computer Vision Engineers training CNN/Transformer models on GPUs or accelerators.
  • ML practitioners wanting faster iteration and larger batch sizes.

Prerequisites

  • Comfortable with PyTorch or TensorFlow training loops.
  • Basic understanding of floating-point arithmetic and training stability.
  • Ability to profile GPU memory and throughput.

Learning path

  1. Review your FP32 training loop and record baseline metrics.
  2. Enable AMP (PyTorch) or mixed_float16 (Keras) and verify parity.
  3. Handle instability: BF16, dynamic loss scaling, float32-sensitive ops.
  4. Profile, then increase batch size or resolution.
  5. Automate checks: NaN detection, gradient norms, convergence tracking.

Next steps

  • Integrate mixed precision into your default training templates.
  • Add automatic fallbacks (BF16/FP32 for sensitive paths).
  • Combine with gradient checkpointing or distributed training for further gains.

Practice Exercises

2 exercises to complete

Instructions

Take an existing FP32 classification loop and add AMP with GradScaler.

  1. Wrap forward + loss in autocast.
  2. Use scaler.scale(loss).backward(), scaler.step(optimizer), scaler.update().
  3. Measure speed and memory before/after. Confirm no NaNs for 1k steps.
# Starter (FP32)
for images, labels in loader:
    images, labels = images.cuda(), labels.cuda()
    optimizer.zero_grad(set_to_none=True)
    logits = model(images)
    loss = criterion(logits, labels)
    loss.backward()
    optimizer.step()
Expected Output
Training runs without NaNs, achieves similar accuracy to FP32, and shows at least ~1.2Ă— speed or larger batch size.

Mixed Precision Training — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Mixed Precision Training?

AI Assistant

Ask questions about this tool