Topic Not Found

Why this matters

As a Computer Vision Engineer, you will often train heavy models (ResNets, YOLO, ViT) on large datasets. Single-GPU training can be too slow or impossible due to memory limits. Distributed training lets you:

Shorten training time from days to hours for experiments and releases.
Fit larger models and batches into memory to improve accuracy and stability.
Scale from one machine to multi-node clusters when datasets grow.
Keep teams productive by running multiple experiments in parallel.

Concept explained simply

Distributed training splits the work across multiple GPUs or machines. The two main ways:

Data parallelism (most common)

Each GPU gets a different shard of the batch, runs a forward/backward pass, then gradients are averaged so all model replicas stay in sync.

Easy to adopt (e.g., PyTorch DDP, Horovod).
Scales well until communication becomes the bottleneck.

Model/pipeline parallelism

When a single model cannot fit into one device’s memory:

Model (tensor) parallel: split layers or matrices across devices.
Pipeline parallel: split layers into stages; micro-batches flow through stages like an assembly line.

Synchronous vs asynchronous

Synchronous (default): all workers compute gradients, then synchronize. Stable convergence, predictable.
Asynchronous: workers push/pull parameters independently (e.g., parameter servers). Faster in some cases, but gradients can be stale.

Communication primitives

All-reduce: aggregates (e.g., sums) tensors across workers and distributes the result back to all.
Broadcast: sends a tensor from one worker to others.
All-gather: gathers tensors from all workers.

Ring all-reduce communication cost per worker is roughly 2*(N-1)/N * tensor_size.

Global batch size and LR scaling

If each GPU uses batch B, and there are N GPUs, global batch is N*B. A common rule is linear LR scaling: if you 4x the global batch, 4x the learning rate (use warmup to stabilize).

Data loading and sharding

Each worker should see a unique slice of data each epoch. Use per-worker seeds and shuffling; avoid duplicate samples to prevent skewed gradients.

Mixed precision and gradient accumulation

Mixed precision uses float16/bfloat16 math to speed up training and reduce memory, with loss scaling to prevent underflow.
Gradient accumulation simulates larger batches by summing gradients over multiple micro-steps before synchronizing.

Fault tolerance and reproducibility

Checkpointing: save model/optimizer states often so you can resume after failures.
Determinism: set seeds, control dataloader order, and choose deterministic ops when possible.

Mental model

Imagine building cars in a factory:

Data parallelism: several identical lines build the same car model; each line works on different cars, then the results are combined.
Pipeline parallelism: one car moves through specialized stations (stages); many cars are in flight simultaneously.
Communication is the conveyor belt; if it’s slow or narrow, everything backs up.

Worked examples

Example 1: Scale classification from 1 to 4 GPUs (DDP)

Baseline: 1 GPU, batch 256, throughput 200 img/s, dataset 1.28M images.
Move to 4 GPUs with the same per-GPU batch 256: global batch = 1024.
Steps/epoch = 1,280,000 / 1024 = 1250.
Estimate throughput: 4 × 200 ≈ 800 img/s. With mixed precision (+~30%) and 10% comm overhead: ≈ 800 × 1.3 × 0.9 ≈ 936 img/s (ballpark).
Time/epoch ≈ 1,280,000 / 936 ≈ 1368 s ≈ 22.8 min. For 90 epochs ≈ 20–22 hours.
LR: if baseline LR was 0.1, with global batch 4× larger, try LR ≈ 0.4 with warmup.

Numbers are approximations. Validate with a short dry run.

Example 2: Model doesn’t fit — pipeline parallel

You have a ViT that OOMs on a single GPU.
Split the model into two stages (e.g., encoder blocks 1–6 on GPU0, 7–12 on GPU1).
Use micro-batches so while GPU0 processes micro-batch k+1, GPU1 processes k.
Expect some pipeline bubble (idle time) at start/end; increase micro-batches to reduce it.

Example 3: Estimate communication vs compute

4 GPUs, gradients total 100 MB (FP32). Ring all-reduce per GPU cost ≈ 2*(4-1)/4*100 = 150 MB.
On a 10 Gbps link (~1.25 GB/s), 150 MB takes ~0.12 s (120 ms).
If compute per step is 300 ms, comm ≈ 28% of step time. Using FP16 halves comm to ~60 ms (~17%).

How to choose a strategy

If your model fits on one GPU and you want speed: data parallel (DDP).
If your model does NOT fit on one GPU: model or pipeline parallel (or hybrid).
If network is slow: reduce gradient size (mixed precision), overlap comm with compute, or use gradient accumulation to reduce sync frequency.
Start simple, measure, then optimize the bottleneck you see.

Performance quick math

Global batch = GPUs × per-GPU batch.
Steps per epoch = dataset_size / global_batch.
Speedup = T1 / TN; Efficiency = Speedup / N.
Ring all-reduce bytes per worker ≈ 2*(N-1)/N * tensor_bytes.
Linear LR scaling: new_LR ≈ base_LR × (new_global_batch / base_global_batch). Use warmup.

Hands-on DDP setup checklist

Seed everything (Python, framework, dataloader workers).
Shard data: each worker sees a unique subset each epoch.
Wrap model in DDP and move it to the right device.
Use mixed precision and gradient scaler.
Log per-worker metrics and averaged metrics; verify equal steps per worker.
Checkpoint with rank 0 only; make resume robust.

Exercises

Complete these to lock in the concepts. A quick test is available below; progress is saved only for logged-in users (anyone can take the test).

Exercise 1 — Plan a 4-GPU run with LR scaling

Given: 1 GPU baseline with batch 256, throughput 200 img/s, dataset 1.28M, 90 epochs. Move to 4 GPUs (DDP), keep per-GPU batch 256, use mixed precision.

Compute global batch and steps per epoch.
Estimate images/sec with a +30% AMP boost and 10% comm overhead.
Estimate time per epoch and total time.
Suggest a new LR from a base 0.1 using linear scaling.

Hint

Global batch = N × per-GPU batch.
Steps/epoch = dataset_size / global_batch.
Throughput estimate = N × singleGPU × 1.3 × 0.9.
LR scales with global batch ratio.

Exercise 2 — Communication budget

4 GPUs, 10 Gbps link (~1.25 GB/s). Gradient size 25M params in FP32. Estimate per-step ring all-reduce time and the effect of FP16 gradients.

Compute gradient size in MB.
Compute ring all-reduce bytes per GPU.
Compute comm time per step for FP32 and FP16.

Hint

Size = params × bytes_per_param.
Ring cost per worker = 2*(N-1)/N × size.
Time = bytes / bandwidth.

Self-check: Your numbers should be close to those in the solutions; small differences due to rounding are fine.
Track assumptions (e.g., bandwidth efficiency) explicitly.

Common mistakes and self-check

Duplicated data across workers. Self-check: count unique sample IDs per epoch; should match dataset size.
Mismatched seeds causing skewed shuffles. Self-check: log first 5 sample IDs per worker; verify no duplicates across workers in the same step.
Wrong LR after scaling batch size. Self-check: compare training loss curves vs baseline; too high LR often explodes early.
Underutilized GPUs due to slow dataloaders. Self-check: monitor GPU utilization; if low, increase workers, use caching, or prefetching.
Saving checkpoints from every rank. Self-check: ensure only one rank writes; others skip to avoid corruption.

Practical projects

Project A: Train ResNet on a public dataset with 1, 2, and 4 GPUs. Measure speedup, efficiency, and final accuracy; plot throughput vs GPUs.
Project B: Take a medium ViT that barely fits and try pipeline parallel across 2 GPUs. Tune micro-batches to reduce bubbles; compare step time.
Project C: Create a reproducible distributed training template: seeding, DDP setup, mixed precision, checkpointing, and a metrics logger.

Mini challenge

Your 4-GPU job shows 40% of step time in all-reduce. Propose two changes to reduce it, then predict the new step time if you cut comm by half and overlap 50% of the remainder with compute.

Example reasoning

Change to FP16 gradients and enable overlap. If comm is 120 ms, half to 60 ms; overlap 50% means 30 ms is hidden. Net visible comm = 30 ms.

Who this is for

Engineers training CV models who need faster iteration or larger batches.
Researchers scaling experiments across GPUs or nodes.
MLOps-minded developers building robust training pipelines.

Prerequisites

Comfort with training CNNs/Transformers on a single GPU.
Basic PyTorch or similar framework knowledge.
Understanding of batches, learning rate, and checkpoints.

Learning path

Before: GPU fundamentals, mixed precision, optimization basics.
This subskill: data vs model/pipeline parallel, sync vs async, comm costs, LR scaling, sharding, and robustness.
After: advanced sharding (optimizer/state), ZeRO-style partitioning, pipeline schedules, multi-node orchestration, and autoscaling.

Next steps

Implement DDP on a small project and record speedup and accuracy.
Try FP16 gradients and measure reduced comm time.
Take the Quick Test below to confirm understanding. Everyone can take it; progress is saved for logged-in users.

Menu

Distributed Training Basics

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

How to choose a strategy

Performance quick math

Hands-on DDP setup checklist

Exercises

Exercise 1 — Plan a 4-GPU run with LR scaling

Exercise 2 — Communication budget

Common mistakes and self-check

Practical projects

Mini challenge

Who this is for

Prerequisites

Learning path

Next steps

Practice Exercises

Plan a 4-GPU run with LR scaling

Instructions

Expected Output

Communication budget with ring all-reduce

Distributed Training Basics — Quick Test

Have questions about Distributed Training Basics?

AI Assistant