How to learn Mixed Precision Training for Training And Optimization in Computer Vision Engineer for free

Why this matters

Mixed precision training lets you train vision models faster and fit larger batches on the same GPU by using lower-precision math where it is safe. For Computer Vision Engineers, this means:

Shorter experiment cycles when tuning backbones and augmentations.
Training larger models (e.g., high-res segmentation, multi-scale detection) on limited VRAM.
Lower cloud costs by making better use of available hardware.

Real tasks you’ll face:

Fitting a Mask R-CNN on a 12–16 GB GPU without reducing image size.
Speeding up ResNet/ViT classification training to iterate on data augmentation quickly.
Stabilizing training when FP16 causes NaNs by switching the right ops back to FP32 or using BF16.

Note: The quick test is available to everyone; only logged-in users get saved progress.

Concept explained simply

Idea: use 16-bit numbers (FP16 or BF16) for most compute to go faster and use less memory, but keep a safe FP32 copy of weights for accuracy. If small gradients underflow in FP16, scale them up temporarily (loss scaling) to keep information.

Mental model

Two buckets of math: fast bucket (16-bit) and safe bucket (32-bit).
Autocast decides which ops can run in 16-bit without hurting stability.
A master FP32 weight copy is updated by the optimizer for accuracy.
Loss scaling prevents tiny gradients from vanishing in FP16.

Core components you will use

Autocast region: runs most layers (conv, matmul) in FP16/BF16; critical reductions happen in FP32.
Master FP32 weights: optimizer updates happen in FP32 even if forward was 16-bit.
Loss scaling: static or dynamic; dynamic usually safer and automatic in modern frameworks.
Precision choice:
- FP16: faster on many NVIDIA GPUs, smaller range; may need more careful scaling.
- BF16: similar speed, larger exponent range, fewer overflows, supported on many newer GPUs/TPUs/CPUs.
Ops that often prefer FP32: softmax+log combos, exp/log, variance/mean stats, normalizations, division by small numbers, some NMS and indexing ops.

Worked examples

Example 1: PyTorch image classification with AMP (FP16)

import torch
from torch import nn, optim
from torch.cuda.amp import autocast, GradScaler

model = ...  # e.g., ResNet50
model = model.to('cuda')
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
criterion = nn.CrossEntropyLoss()
scaler = GradScaler()  # dynamic loss scaling

for images, labels in train_loader:
    images = images.to('cuda', non_blocking=True)
    labels = labels.to('cuda', non_blocking=True)
    optimizer.zero_grad(set_to_none=True)

    with autocast():
        logits = model(images)
        loss = criterion(logits, labels)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Expected: 1.2–1.6× training speedup and 30–50% less VRAM use (rough guideline).
If you see NaNs, start by checking data, then try BF16 or keep sensitive ops in FP32.

Example 2: TensorFlow/Keras segmentation with mixed_float16

import tensorflow as tf
from tensorflow import keras

# Enable mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

inputs = keras.Input(shape=(512,512,3))
x = keras.layers.Conv2D(32, 3, activation='relu')(inputs)
# ... your UNet/DeepLab-like model ...
outputs = keras.layers.Conv2D(21, 1, dtype='float32', activation='softmax')(x)  # keep numerically sensitive output in float32
model = keras.Model(inputs, outputs)

opt = keras.optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(train_ds, validation_data=val_ds, epochs=20)

Policy applies mixed precision automatically; loss scaling is handled for you.
For final logits/softmax, outputting float32 can improve numeric stability.

Example 3: Object detection with selective autocast

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
model = ...  # detection model

for images, targets in loader:
    images = [img.cuda(non_blocking=True) for img in images]
    optimizer.zero_grad(set_to_none=True)

    with autocast():
        losses = model(images, targets)  # returns dict of loss terms in many detection libs
        loss = sum(losses.values())

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    # If post-processing like NMS breaks, move it out of autocast or cast to float32
    with torch.cuda.amp.autocast(enabled=False):
        # nms_logits = logits.float()
        pass

Tip: Keep post-processing (NMS, top-k indexing) in float32 if you see inconsistent results.

Performance and stability playbook

Establish a FP32 baseline for loss/accuracy and throughput.
Enable mixed precision (FP16 or BF16) and re-check loss curves match early epochs.
If NaNs or divergence appear:
- Switch to BF16 if supported.
- Ensure dynamic loss scaling is on; try gradient clipping (e.g., norm 1.0).
- Run sensitive ops in float32 (softmax/logits, normalization stats).
Profile: confirm higher GPU utilization and lower memory. Increase batch size if VRAM allows.
Converged? Compare final metrics to FP32. Aim for parity within noise.

Common mistakes and self-check

Mistake: Forgetting master FP32 weights. Symptom: accuracy drop. Fix: use built-in AMP/keras policy; do not manually cast model weights to half.
Mistake: Keeping all ops in 16-bit. Symptom: NaNs/Inf. Fix: run reductions/normalizations/softmax in float32.
Mistake: Disabled or static loss scaling too small. Symptom: underflow, stalled learning. Fix: dynamic scaling or increase static scale.
Mistake: Comparing to a weak FP32 baseline. Symptom: misleading speedup. Fix: use the same data pipeline, cudnn benchmarking, etc.

Self-check

Do your mixed-precision curves track FP32 in the first few epochs?
Is throughput at least ~1.2× faster or batch size larger?
No NaNs in loss or gradients after 1000 steps?

Practical projects

Project 1: ResNet50 on a medium dataset (e.g., 100–200 classes). Goal: reach FP32 accuracy within 0.2% using AMP and increase batch size 1.5×.
Project 2: UNet segmentation at 512×512. Goal: fit a larger batch or higher resolution using mixed precision; verify IoU parity with FP32.
Project 3: Custom detector with NMS. Goal: ensure NMS and post-processing run in float32 if needed; validate precision/recall parity.

Exercises

Do these in order, then take the quick test.

Exercise 1: Convert a PyTorch loop to AMP

Start with a working FP32 loop. Add autocast and GradScaler properly.

Use autocast only around forward and loss.
Scale, backward, step via scaler; then update.
Confirm no NaNs and measure speed/memory.

Exercise 2: Enable mixed precision in Keras and stabilize output

Turn on mixed_float16 policy, then ensure final logits or softmax computations are float32 if training becomes unstable.

Set global policy.
Cast output layer dtype to float32.
Confirm val accuracy matches FP32 within noise.

[Checklist] I compared FP32 vs mixed precision curves.
[Checklist] I verified no NaNs/Infs for 1k steps.
[Checklist] I measured speed and VRAM gains.

Mini challenge

Take a model that currently fits batch size 16 in FP32. With mixed precision, target batch size 24–32 while keeping accuracy within 0.2% of baseline. Write down which ops you kept at float32 and why.

Who this is for

Computer Vision Engineers training CNN/Transformer models on GPUs or accelerators.
ML practitioners wanting faster iteration and larger batch sizes.

Prerequisites

Comfortable with PyTorch or TensorFlow training loops.
Basic understanding of floating-point arithmetic and training stability.
Ability to profile GPU memory and throughput.

Learning path

Review your FP32 training loop and record baseline metrics.
Enable AMP (PyTorch) or mixed_float16 (Keras) and verify parity.
Handle instability: BF16, dynamic loss scaling, float32-sensitive ops.
Profile, then increase batch size or resolution.
Automate checks: NaN detection, gradient norms, convergence tracking.

Next steps

Integrate mixed precision into your default training templates.
Add automatic fallbacks (BF16/FP32 for sensitive paths).
Combine with gradient checkpointing or distributed training for further gains.

Menu

Mixed Precision Training

Table of Contents

Why this matters

Concept explained simply

Core components you will use

Worked examples

Performance and stability playbook

Common mistakes and self-check

Practical projects

Exercises

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

Convert a PyTorch FP32 loop to mixed precision

Instructions

Expected Output

Enable Keras mixed precision and stabilize output layer

Mixed Precision Training — Quick Test

Have questions about Mixed Precision Training?

AI Assistant