Why Training and Optimization matters for Computer Vision Engineers
Training and optimization turn a model architecture into a reliable, fast, and generalizable solution. For Computer Vision Engineers, this skill unlocks: faster model iteration, higher accuracy on challenging datasets, stable multi-GPU training, and the ability to reproduce and explain results for production.
- Ship models that meet latency/accuracy targets.
- Handle class imbalance typical in detection/segmentation.
- Scale to multi-GPU without losing correctness.
- Track experiments and reproduce wins on demand.
What you’ll learn
- Choose and implement the right loss for classification, detection, and segmentation.
- Tune hyperparameters efficiently with small budgets.
- Train with mixed precision and distributed setups safely.
- Build efficient data pipelines and control overfitting.
- Track experiments, seed runs, and reproduce results.
Who this is for
- Engineers building CV systems (classification, detection, segmentation).
- Data scientists moving from notebooks to production-grade training.
- Researchers who need consistent, comparable experiments.
Prerequisites
- Python and PyTorch basics (tensors, nn.Module, DataLoader).
- Familiarity with at least one CV task (e.g., image classification).
- Comfort using a GPU environment (CUDA awareness helps).
Learning path
- Milestone 1: Loss functions for vision
Understand cross-entropy, BCEWithLogits, SmoothL1, IoU/Dice; add label smoothing. - Milestone 2: Handle imbalance
Use Focal Loss and class weighting; verify per-class metrics. - Milestone 3: Hyperparameter tuning
Design small search spaces; run random search; adopt LR scheduling. - Milestone 4: Mixed precision
Use autocast + GradScaler; test stability and speed. - Milestone 5: Distributed training
DDP with DistributedSampler; ensure correct seeding and evaluation. - Milestone 6: Data loading + regularization
Optimize DataLoader; use augmentation, weight decay, early stopping. - Milestone 7: Tracking + reproducibility
Seed everything, log configs/metrics, save/resume checkpoints consistently.
Worked examples
1) Classification with label smoothing, weight decay, and early stopping (PyTorch)
import torch, torch.nn as nn, torch.optim as optim
from torchvision import models
model = models.resnet18(num_classes=10)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = optim.CosineAnnealingLR(optimizer, T_max=20)
best_val = float('inf')
patience, bad_epochs = 5, 0
for epoch in range(50):
model.train()
for x, y in train_loader:
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
scheduler.step()
# validate
model.eval()
val_loss = 0.0
with torch.no_grad():
for x, y in val_loader:
val_loss += criterion(model(x), y).item()
val_loss /= len(val_loader)
if val_loss < best_val:
best_val = val_loss
bad_epochs = 0
torch.save({'model': model.state_dict()}, 'best.pt')
else:
bad_epochs += 1
if bad_epochs >= patience:
print('Early stopping')
break
Why it works: label smoothing reduces overconfidence; AdamW with weight decay improves generalization; early stopping prevents overfitting.
2) Implement Focal Loss for multi-class imbalance
import torch
import torch.nn as nn
import torch.nn.functional as F
class FocalLoss(nn.Module):
def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
super().__init__()
self.gamma = gamma
self.alpha = alpha # tensor of shape [C] or scalar or None
self.reduction = reduction
def forward(self, logits, target):
# logits: [N, C], target: [N]
logp = F.log_softmax(logits, dim=1)
p = logp.exp()
pt = p.gather(1, target.unsqueeze(1)).squeeze(1) # [N]
logpt = logp.gather(1, target.unsqueeze(1)).squeeze(1)
focal = (1 - pt).pow(self.gamma) * (-logpt)
if self.alpha is not None:
if isinstance(self.alpha, torch.Tensor):
at = self.alpha.to(logits.device).gather(0, target)
else:
at = torch.full_like(pt, float(self.alpha))
focal = at * focal
if self.reduction == 'mean':
return focal.mean()
if self.reduction == 'sum':
return focal.sum()
return focal
Tip: Inspect per-class recall before and after. Focal Loss should help rare classes by emphasizing hard examples.
3) Mixed precision training (autocast + GradScaler)
import torch
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
model = model.cuda()
for epoch in range(10):
model.train()
for x, y in train_loader:
x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
optimizer.zero_grad(set_to_none=True)
with autocast():
logits = model(x)
loss = criterion(logits, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Benefits: faster training and lower memory with minimal code changes. If you see inf/NaN, try lower LR or disable AMP for numerically unstable ops.
4) Single-node DistributedDataParallel (DDP) essentials
# Launched via: torchrun --nproc_per_node=2 train.py
import os, torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DistributedSampler
rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
dist.init_process_group('nccl')
torch.cuda.set_device(local_rank)
model = model.cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank, shuffle=True)
loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, sampler=sampler, num_workers=4, pin_memory=True, persistent_workers=True)
for epoch in range(20):
sampler.set_epoch(epoch)
for x, y in loader:
x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
optimizer.zero_grad(set_to_none=True)
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
dist.destroy_process_group()
Critical points: use DistributedSampler, call sampler.set_epoch each epoch, pin each process to its GPU, and avoid non-deterministic shuffles across ranks.
5) Efficient DataLoader tuning
from torchvision import transforms
transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor()
])
loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=128,
shuffle=True,
num_workers=4,
pin_memory=True,
prefetch_factor=2,
persistent_workers=True
)
# Quick throughput probe (warm up then measure a few batches)
import time
it = iter(loader)
for _ in range(5): next(it) # warmup
start = time.time()
for _ in range(20):
batch = next(it)
end = time.time()
print('~batches/sec:', 20/(end-start))
Try varying batch size, num_workers, and prefetch_factor. On GPUs, pin_memory often improves throughput.
6) Simple random search for LR and weight decay
import random
search = []
for _ in range(8):
lr = 10 ** random.uniform(-5, -3)
wd = 10 ** random.uniform(-6, -2)
search.append({'lr': lr, 'wd': wd})
results = []
for cfg in search:
model = make_model()
opt = torch.optim.AdamW(model.parameters(), lr=cfg['lr'], weight_decay=cfg['wd'])
val = train_for_few_epochs(model, opt, train_loader, val_loader, max_epochs=5) # small budget
results.append((cfg, val['acc']))
best = max(results, key=lambda x: x[1])
print('Best config:', best[0], 'Val acc:', best[1])
Random search explores more unique values with the same budget than grid search and often finds better configs quickly.
Mini tasks
- Swap CrossEntropyLoss for Focal Loss on an imbalanced classifier and compare per-class recall.
- Enable mixed precision and report speedup and max batch size difference.
- Increase DataLoader num_workers stepwise (0, 2, 4, 8) and record batches/sec.
- Run a small random search over LR and weight decay using only 10% of data.
- If you have 2+ GPUs, run the same training with DDP and confirm identical validation metrics across runs.
Drills / exercises
- Implement label smoothing both via CrossEntropyLoss and manually with soft labels; verify equivalence.
- Create a training loop with early stopping and best-checkpoint saving.
- Add cosine LR schedule with warmup and compare convergence to fixed LR.
- Log metrics to a JSONL file (one JSON per line) and include hyperparameters in every record.
- Write a reproducibility helper that sets seeds and configures deterministic backends.
Common mistakes and debugging tips
- Validating in train mode: always use model.eval() and torch.no_grad() for validation.
- Imbalance ignored: overall accuracy looks fine but minority recall is near zero; inspect per-class metrics.
- DDP shuffling errors: forgetting sampler.set_epoch(epoch) causes repeated batches and poor generalization.
- Unstable AMP: loss becomes NaN; reduce LR, disable AMP for specific ops, or clamp gradients.
- Data bottlenecks: GPU idle while CPU loads data; raise num_workers, enable pin_memory, and simplify heavy augmentations.
- Over-regularization: too much weight decay/dropout leads to underfitting; monitor train vs val gap.
Quick debugging checklist
- Sanity check: overfit a tiny subset (e.g., 512 samples). If it can’t, there’s a bug in data/labels/model.
- Monitor gradient norms. Exploding? Lower LR, add gradient clipping.
- Use a LR finder: sweep LR log-scale for a few hundred steps and pick the largest stable LR.
- Print a few batch labels and predictions to verify order, shapes, and label mapping.
Mini project: Imbalanced defect classifier
Goal: Build a small image classifier where positive defects are rare.
- Dataset: create splits; inspect class distribution.
- Baseline: ResNet-18 with CrossEntropyLoss; compute per-class metrics.
- Imbalance handling: switch to Focal Loss or class weights; compare recall for the rare class.
- Regularization: try label smoothing and weight decay; add basic augmentation.
- Efficiency: enable mixed precision; tune DataLoader workers.
- Tuning: random search LR and weight decay on a 5-epoch budget; keep the best.
- Reproducibility: seed runs, log configs/metrics to JSONL, save best checkpoint.
Deliverables to verify
- Training/validation curves with best checkpoint noted.
- Per-class precision/recall before and after imbalance mitigation.
- Throughput measurements (batches/sec) for data loading changes.
- Config file or JSON record of best hyperparameters.
Subskills
- Loss Functions For Vision Tasks — choose/implement CrossEntropy, BCEWithLogits, SmoothL1, IoU/Dice correctly for task type.
- Class Imbalance Losses Focal Loss — mitigate imbalance via Focal Loss, class weights, or sampling strategies.
- Hyperparameter Tuning Basics — design small yet effective search spaces; use random/grid search and LR schedules.
- Mixed Precision Training — train with autocast + GradScaler safely and measure speed/accuracy.
- Distributed Training Basics — run DDP with DistributedSampler and correct seeding.
- Efficient Data Loaders — optimize workers, pin_memory, and prefetch to remove bottlenecks.
- Regularization And Overfitting Control — apply weight decay, dropout, augmentation, and early stopping.
- Experiment Tracking And Reproducibility — log configs/metrics, set seeds, save/resume checkpoints.
Next steps
- Try the skill exam to check your understanding. Anyone can take it; logged-in users get saved progress.
- Extend the mini project to detection or segmentation and re-apply the same optimization toolkit.
- Prepare for deployment by learning quantization, pruning, and runtime profiling.