Skill Not Found

Why Training and Optimization matters for Computer Vision Engineers

Training and optimization turn a model architecture into a reliable, fast, and generalizable solution. For Computer Vision Engineers, this skill unlocks: faster model iteration, higher accuracy on challenging datasets, stable multi-GPU training, and the ability to reproduce and explain results for production.

Ship models that meet latency/accuracy targets.
Handle class imbalance typical in detection/segmentation.
Scale to multi-GPU without losing correctness.
Track experiments and reproduce wins on demand.

What you’ll learn

Choose and implement the right loss for classification, detection, and segmentation.
Tune hyperparameters efficiently with small budgets.
Train with mixed precision and distributed setups safely.
Build efficient data pipelines and control overfitting.
Track experiments, seed runs, and reproduce results.

Who this is for

Engineers building CV systems (classification, detection, segmentation).
Data scientists moving from notebooks to production-grade training.
Researchers who need consistent, comparable experiments.

Prerequisites

Python and PyTorch basics (tensors, nn.Module, DataLoader).
Familiarity with at least one CV task (e.g., image classification).
Comfort using a GPU environment (CUDA awareness helps).

Learning path

Milestone 1: Loss functions for vision
Understand cross-entropy, BCEWithLogits, SmoothL1, IoU/Dice; add label smoothing.
Milestone 2: Handle imbalance
Use Focal Loss and class weighting; verify per-class metrics.
Milestone 3: Hyperparameter tuning
Design small search spaces; run random search; adopt LR scheduling.
Milestone 4: Mixed precision
Use autocast + GradScaler; test stability and speed.
Milestone 5: Distributed training
DDP with DistributedSampler; ensure correct seeding and evaluation.
Milestone 6: Data loading + regularization
Optimize DataLoader; use augmentation, weight decay, early stopping.
Milestone 7: Tracking + reproducibility
Seed everything, log configs/metrics, save/resume checkpoints consistently.

Worked examples

1) Classification with label smoothing, weight decay, and early stopping (PyTorch)

import torch, torch.nn as nn, torch.optim as optim
from torchvision import models

model = models.resnet18(num_classes=10)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = optim.CosineAnnealingLR(optimizer, T_max=20)

best_val = float('inf')
patience, bad_epochs = 5, 0

for epoch in range(50):
    model.train()
    for x, y in train_loader:
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()
    scheduler.step()

    # validate
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for x, y in val_loader:
            val_loss += criterion(model(x), y).item()
    val_loss /= len(val_loader)

    if val_loss < best_val:
        best_val = val_loss
        bad_epochs = 0
        torch.save({'model': model.state_dict()}, 'best.pt')
    else:
        bad_epochs += 1
        if bad_epochs >= patience:
            print('Early stopping')
            break

Why it works: label smoothing reduces overconfidence; AdamW with weight decay improves generalization; early stopping prevents overfitting.

2) Implement Focal Loss for multi-class imbalance

import torch
import torch.nn as nn
import torch.nn.functional as F

class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha  # tensor of shape [C] or scalar or None
        self.reduction = reduction

    def forward(self, logits, target):
        # logits: [N, C], target: [N]
        logp = F.log_softmax(logits, dim=1)
        p = logp.exp()
        pt = p.gather(1, target.unsqueeze(1)).squeeze(1)  # [N]
        logpt = logp.gather(1, target.unsqueeze(1)).squeeze(1)
        focal = (1 - pt).pow(self.gamma) * (-logpt)
        if self.alpha is not None:
            if isinstance(self.alpha, torch.Tensor):
                at = self.alpha.to(logits.device).gather(0, target)
            else:
                at = torch.full_like(pt, float(self.alpha))
            focal = at * focal
        if self.reduction == 'mean':
            return focal.mean()
        if self.reduction == 'sum':
            return focal.sum()
        return focal

Tip: Inspect per-class recall before and after. Focal Loss should help rare classes by emphasizing hard examples.

3) Mixed precision training (autocast + GradScaler)

import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
model = model.cuda()

for epoch in range(10):
    model.train()
    for x, y in train_loader:
        x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with autocast():
            logits = model(x)
            loss = criterion(logits, y)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Benefits: faster training and lower memory with minimal code changes. If you see inf/NaN, try lower LR or disable AMP for numerically unstable ops.

4) Single-node DistributedDataParallel (DDP) essentials

# Launched via: torchrun --nproc_per_node=2 train.py
import os, torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DistributedSampler

rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])

dist.init_process_group('nccl')
torch.cuda.set_device(local_rank)

model = model.cuda(local_rank)
model = DDP(model, device_ids=[local_rank])

sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank, shuffle=True)
loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, sampler=sampler, num_workers=4, pin_memory=True, persistent_workers=True)

for epoch in range(20):
    sampler.set_epoch(epoch)
    for x, y in loader:
        x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()

dist.destroy_process_group()

Critical points: use DistributedSampler, call sampler.set_epoch each epoch, pin each process to its GPU, and avoid non-deterministic shuffles across ranks.

5) Efficient DataLoader tuning

from torchvision import transforms

transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor()
])

loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=128,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    prefetch_factor=2,
    persistent_workers=True
)

# Quick throughput probe (warm up then measure a few batches)
import time
it = iter(loader)
for _ in range(5): next(it)  # warmup
start = time.time()
for _ in range(20):
    batch = next(it)
end = time.time()
print('~batches/sec:', 20/(end-start))

Try varying batch size, num_workers, and prefetch_factor. On GPUs, pin_memory often improves throughput.

6) Simple random search for LR and weight decay

import random
search = []
for _ in range(8):
    lr = 10 ** random.uniform(-5, -3)
    wd = 10 ** random.uniform(-6, -2)
    search.append({'lr': lr, 'wd': wd})

results = []
for cfg in search:
    model = make_model()
    opt = torch.optim.AdamW(model.parameters(), lr=cfg['lr'], weight_decay=cfg['wd'])
    val = train_for_few_epochs(model, opt, train_loader, val_loader, max_epochs=5)  # small budget
    results.append((cfg, val['acc']))

best = max(results, key=lambda x: x[1])
print('Best config:', best[0], 'Val acc:', best[1])

Random search explores more unique values with the same budget than grid search and often finds better configs quickly.

Mini tasks

Swap CrossEntropyLoss for Focal Loss on an imbalanced classifier and compare per-class recall.
Enable mixed precision and report speedup and max batch size difference.
Increase DataLoader num_workers stepwise (0, 2, 4, 8) and record batches/sec.
Run a small random search over LR and weight decay using only 10% of data.
If you have 2+ GPUs, run the same training with DDP and confirm identical validation metrics across runs.

Drills / exercises

Implement label smoothing both via CrossEntropyLoss and manually with soft labels; verify equivalence.
Create a training loop with early stopping and best-checkpoint saving.
Add cosine LR schedule with warmup and compare convergence to fixed LR.
Log metrics to a JSONL file (one JSON per line) and include hyperparameters in every record.
Write a reproducibility helper that sets seeds and configures deterministic backends.

Common mistakes and debugging tips

Validating in train mode: always use model.eval() and torch.no_grad() for validation.
Imbalance ignored: overall accuracy looks fine but minority recall is near zero; inspect per-class metrics.
DDP shuffling errors: forgetting sampler.set_epoch(epoch) causes repeated batches and poor generalization.
Unstable AMP: loss becomes NaN; reduce LR, disable AMP for specific ops, or clamp gradients.
Data bottlenecks: GPU idle while CPU loads data; raise num_workers, enable pin_memory, and simplify heavy augmentations.
Over-regularization: too much weight decay/dropout leads to underfitting; monitor train vs val gap.

Quick debugging checklist

Sanity check: overfit a tiny subset (e.g., 512 samples). If it can’t, there’s a bug in data/labels/model.
Monitor gradient norms. Exploding? Lower LR, add gradient clipping.
Use a LR finder: sweep LR log-scale for a few hundred steps and pick the largest stable LR.
Print a few batch labels and predictions to verify order, shapes, and label mapping.

Mini project: Imbalanced defect classifier

Goal: Build a small image classifier where positive defects are rare.

Dataset: create splits; inspect class distribution.
Baseline: ResNet-18 with CrossEntropyLoss; compute per-class metrics.
Imbalance handling: switch to Focal Loss or class weights; compare recall for the rare class.
Regularization: try label smoothing and weight decay; add basic augmentation.
Efficiency: enable mixed precision; tune DataLoader workers.
Tuning: random search LR and weight decay on a 5-epoch budget; keep the best.
Reproducibility: seed runs, log configs/metrics to JSONL, save best checkpoint.

Deliverables to verify

Training/validation curves with best checkpoint noted.
Per-class precision/recall before and after imbalance mitigation.
Throughput measurements (batches/sec) for data loading changes.
Config file or JSON record of best hyperparameters.

Subskills

Loss Functions For Vision Tasks — choose/implement CrossEntropy, BCEWithLogits, SmoothL1, IoU/Dice correctly for task type.
Class Imbalance Losses Focal Loss — mitigate imbalance via Focal Loss, class weights, or sampling strategies.
Hyperparameter Tuning Basics — design small yet effective search spaces; use random/grid search and LR schedules.
Mixed Precision Training — train with autocast + GradScaler safely and measure speed/accuracy.
Distributed Training Basics — run DDP with DistributedSampler and correct seeding.
Efficient Data Loaders — optimize workers, pin_memory, and prefetch to remove bottlenecks.
Regularization And Overfitting Control — apply weight decay, dropout, augmentation, and early stopping.
Experiment Tracking And Reproducibility — log configs/metrics, set seeds, save/resume checkpoints.

Next steps

Try the skill exam to check your understanding. Anyone can take it; logged-in users get saved progress.
Extend the mini project to detection or segmentation and re-apply the same optimization toolkit.
Prepare for deployment by learning quantization, pruning, and runtime profiling.

Menu

Training And Optimization

Table of Contents

Why Training and Optimization matters for Computer Vision Engineers

What you’ll learn

Who this is for

Prerequisites

Learning path

Worked examples

Mini tasks

Drills / exercises

Common mistakes and debugging tips

Mini project: Imbalanced defect classifier

Subskills

Next steps

Training And Optimization — Skill Exam

Topics

Distributed Training Basics

Efficient Data Loaders

Regularization And Overfitting Control

Experiment Tracking And Reproducibility

Loss Functions For Vision Tasks

Class Imbalance Losses Focal Loss

Hyperparameter Tuning Basics

Mixed Precision Training

Have questions about Training And Optimization?

AI Assistant