Why this matters
As a Computer Vision Engineer, you will often train heavy models (ResNets, YOLO, ViT) on large datasets. Single-GPU training can be too slow or impossible due to memory limits. Distributed training lets you:
- Shorten training time from days to hours for experiments and releases.
- Fit larger models and batches into memory to improve accuracy and stability.
- Scale from one machine to multi-node clusters when datasets grow.
- Keep teams productive by running multiple experiments in parallel.
Concept explained simply
Distributed training splits the work across multiple GPUs or machines. The two main ways:
Data parallelism (most common)
Each GPU gets a different shard of the batch, runs a forward/backward pass, then gradients are averaged so all model replicas stay in sync.
- Easy to adopt (e.g., PyTorch DDP, Horovod).
- Scales well until communication becomes the bottleneck.
Model/pipeline parallelism
When a single model cannot fit into one device’s memory:
- Model (tensor) parallel: split layers or matrices across devices.
- Pipeline parallel: split layers into stages; micro-batches flow through stages like an assembly line.
Synchronous vs asynchronous
- Synchronous (default): all workers compute gradients, then synchronize. Stable convergence, predictable.
- Asynchronous: workers push/pull parameters independently (e.g., parameter servers). Faster in some cases, but gradients can be stale.
Communication primitives
- All-reduce: aggregates (e.g., sums) tensors across workers and distributes the result back to all.
- Broadcast: sends a tensor from one worker to others.
- All-gather: gathers tensors from all workers.
Ring all-reduce communication cost per worker is roughly 2*(N-1)/N * tensor_size.
Global batch size and LR scaling
If each GPU uses batch B, and there are N GPUs, global batch is N*B. A common rule is linear LR scaling: if you 4x the global batch, 4x the learning rate (use warmup to stabilize).
Data loading and sharding
Each worker should see a unique slice of data each epoch. Use per-worker seeds and shuffling; avoid duplicate samples to prevent skewed gradients.
Mixed precision and gradient accumulation
- Mixed precision uses float16/bfloat16 math to speed up training and reduce memory, with loss scaling to prevent underflow.
- Gradient accumulation simulates larger batches by summing gradients over multiple micro-steps before synchronizing.
Fault tolerance and reproducibility
- Checkpointing: save model/optimizer states often so you can resume after failures.
- Determinism: set seeds, control dataloader order, and choose deterministic ops when possible.
Mental model
Imagine building cars in a factory:
- Data parallelism: several identical lines build the same car model; each line works on different cars, then the results are combined.
- Pipeline parallelism: one car moves through specialized stations (stages); many cars are in flight simultaneously.
- Communication is the conveyor belt; if it’s slow or narrow, everything backs up.
Worked examples
Example 1: Scale classification from 1 to 4 GPUs (DDP)
- Baseline: 1 GPU, batch 256, throughput 200 img/s, dataset 1.28M images.
- Move to 4 GPUs with the same per-GPU batch 256: global batch = 1024.
- Steps/epoch = 1,280,000 / 1024 = 1250.
- Estimate throughput: 4 × 200 ≈ 800 img/s. With mixed precision (+~30%) and 10% comm overhead: ≈ 800 × 1.3 × 0.9 ≈ 936 img/s (ballpark).
- Time/epoch ≈ 1,280,000 / 936 ≈ 1368 s ≈ 22.8 min. For 90 epochs ≈ 20–22 hours.
- LR: if baseline LR was 0.1, with global batch 4× larger, try LR ≈ 0.4 with warmup.
Numbers are approximations. Validate with a short dry run.
Example 2: Model doesn’t fit — pipeline parallel
- You have a ViT that OOMs on a single GPU.
- Split the model into two stages (e.g., encoder blocks 1–6 on GPU0, 7–12 on GPU1).
- Use micro-batches so while GPU0 processes micro-batch k+1, GPU1 processes k.
- Expect some pipeline bubble (idle time) at start/end; increase micro-batches to reduce it.
Example 3: Estimate communication vs compute
- 4 GPUs, gradients total 100 MB (FP32). Ring all-reduce per GPU cost ≈ 2*(4-1)/4*100 = 150 MB.
- On a 10 Gbps link (~1.25 GB/s), 150 MB takes ~0.12 s (120 ms).
- If compute per step is 300 ms, comm ≈ 28% of step time. Using FP16 halves comm to ~60 ms (~17%).
How to choose a strategy
- If your model fits on one GPU and you want speed: data parallel (DDP).
- If your model does NOT fit on one GPU: model or pipeline parallel (or hybrid).
- If network is slow: reduce gradient size (mixed precision), overlap comm with compute, or use gradient accumulation to reduce sync frequency.
- Start simple, measure, then optimize the bottleneck you see.
Performance quick math
- Global batch = GPUs × per-GPU batch.
- Steps per epoch = dataset_size / global_batch.
- Speedup = T1 / TN; Efficiency = Speedup / N.
- Ring all-reduce bytes per worker ≈ 2*(N-1)/N * tensor_bytes.
- Linear LR scaling: new_LR ≈ base_LR × (new_global_batch / base_global_batch). Use warmup.
Hands-on DDP setup checklist
- Seed everything (Python, framework, dataloader workers).
- Shard data: each worker sees a unique subset each epoch.
- Wrap model in DDP and move it to the right device.
- Use mixed precision and gradient scaler.
- Log per-worker metrics and averaged metrics; verify equal steps per worker.
- Checkpoint with rank 0 only; make resume robust.
Exercises
Complete these to lock in the concepts. A quick test is available below; progress is saved only for logged-in users (anyone can take the test).
Exercise 1 — Plan a 4-GPU run with LR scaling
Given: 1 GPU baseline with batch 256, throughput 200 img/s, dataset 1.28M, 90 epochs. Move to 4 GPUs (DDP), keep per-GPU batch 256, use mixed precision.
- Compute global batch and steps per epoch.
- Estimate images/sec with a +30% AMP boost and 10% comm overhead.
- Estimate time per epoch and total time.
- Suggest a new LR from a base 0.1 using linear scaling.
Hint
- Global batch = N × per-GPU batch.
- Steps/epoch = dataset_size / global_batch.
- Throughput estimate = N × singleGPU × 1.3 × 0.9.
- LR scales with global batch ratio.
Exercise 2 — Communication budget
4 GPUs, 10 Gbps link (~1.25 GB/s). Gradient size 25M params in FP32. Estimate per-step ring all-reduce time and the effect of FP16 gradients.
- Compute gradient size in MB.
- Compute ring all-reduce bytes per GPU.
- Compute comm time per step for FP32 and FP16.
Hint
- Size = params × bytes_per_param.
- Ring cost per worker = 2*(N-1)/N × size.
- Time = bytes / bandwidth.
- Self-check: Your numbers should be close to those in the solutions; small differences due to rounding are fine.
- Track assumptions (e.g., bandwidth efficiency) explicitly.
Common mistakes and self-check
- Duplicated data across workers. Self-check: count unique sample IDs per epoch; should match dataset size.
- Mismatched seeds causing skewed shuffles. Self-check: log first 5 sample IDs per worker; verify no duplicates across workers in the same step.
- Wrong LR after scaling batch size. Self-check: compare training loss curves vs baseline; too high LR often explodes early.
- Underutilized GPUs due to slow dataloaders. Self-check: monitor GPU utilization; if low, increase workers, use caching, or prefetching.
- Saving checkpoints from every rank. Self-check: ensure only one rank writes; others skip to avoid corruption.
Practical projects
- Project A: Train ResNet on a public dataset with 1, 2, and 4 GPUs. Measure speedup, efficiency, and final accuracy; plot throughput vs GPUs.
- Project B: Take a medium ViT that barely fits and try pipeline parallel across 2 GPUs. Tune micro-batches to reduce bubbles; compare step time.
- Project C: Create a reproducible distributed training template: seeding, DDP setup, mixed precision, checkpointing, and a metrics logger.
Mini challenge
Your 4-GPU job shows 40% of step time in all-reduce. Propose two changes to reduce it, then predict the new step time if you cut comm by half and overlap 50% of the remainder with compute.
Example reasoning
Change to FP16 gradients and enable overlap. If comm is 120 ms, half to 60 ms; overlap 50% means 30 ms is hidden. Net visible comm = 30 ms.
Who this is for
- Engineers training CV models who need faster iteration or larger batches.
- Researchers scaling experiments across GPUs or nodes.
- MLOps-minded developers building robust training pipelines.
Prerequisites
- Comfort with training CNNs/Transformers on a single GPU.
- Basic PyTorch or similar framework knowledge.
- Understanding of batches, learning rate, and checkpoints.
Learning path
- Before: GPU fundamentals, mixed precision, optimization basics.
- This subskill: data vs model/pipeline parallel, sync vs async, comm costs, LR scaling, sharding, and robustness.
- After: advanced sharding (optimizer/state), ZeRO-style partitioning, pipeline schedules, multi-node orchestration, and autoscaling.
Next steps
- Implement DDP on a small project and record speedup and accuracy.
- Try FP16 gradients and measure reduced comm time.
- Take the Quick Test below to confirm understanding. Everyone can take it; progress is saved for logged-in users.