luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Distributed Training Basics

Learn Distributed Training Basics for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As a Computer Vision Engineer, you will often train heavy models (ResNets, YOLO, ViT) on large datasets. Single-GPU training can be too slow or impossible due to memory limits. Distributed training lets you:

  • Shorten training time from days to hours for experiments and releases.
  • Fit larger models and batches into memory to improve accuracy and stability.
  • Scale from one machine to multi-node clusters when datasets grow.
  • Keep teams productive by running multiple experiments in parallel.

Concept explained simply

Distributed training splits the work across multiple GPUs or machines. The two main ways:

Data parallelism (most common)

Each GPU gets a different shard of the batch, runs a forward/backward pass, then gradients are averaged so all model replicas stay in sync.

  • Easy to adopt (e.g., PyTorch DDP, Horovod).
  • Scales well until communication becomes the bottleneck.
Model/pipeline parallelism

When a single model cannot fit into one device’s memory:

  • Model (tensor) parallel: split layers or matrices across devices.
  • Pipeline parallel: split layers into stages; micro-batches flow through stages like an assembly line.
Synchronous vs asynchronous
  • Synchronous (default): all workers compute gradients, then synchronize. Stable convergence, predictable.
  • Asynchronous: workers push/pull parameters independently (e.g., parameter servers). Faster in some cases, but gradients can be stale.
Communication primitives
  • All-reduce: aggregates (e.g., sums) tensors across workers and distributes the result back to all.
  • Broadcast: sends a tensor from one worker to others.
  • All-gather: gathers tensors from all workers.

Ring all-reduce communication cost per worker is roughly 2*(N-1)/N * tensor_size.

Global batch size and LR scaling

If each GPU uses batch B, and there are N GPUs, global batch is N*B. A common rule is linear LR scaling: if you 4x the global batch, 4x the learning rate (use warmup to stabilize).

Data loading and sharding

Each worker should see a unique slice of data each epoch. Use per-worker seeds and shuffling; avoid duplicate samples to prevent skewed gradients.

Mixed precision and gradient accumulation
  • Mixed precision uses float16/bfloat16 math to speed up training and reduce memory, with loss scaling to prevent underflow.
  • Gradient accumulation simulates larger batches by summing gradients over multiple micro-steps before synchronizing.
Fault tolerance and reproducibility
  • Checkpointing: save model/optimizer states often so you can resume after failures.
  • Determinism: set seeds, control dataloader order, and choose deterministic ops when possible.

Mental model

Imagine building cars in a factory:

  • Data parallelism: several identical lines build the same car model; each line works on different cars, then the results are combined.
  • Pipeline parallelism: one car moves through specialized stations (stages); many cars are in flight simultaneously.
  • Communication is the conveyor belt; if it’s slow or narrow, everything backs up.

Worked examples

Example 1: Scale classification from 1 to 4 GPUs (DDP)
  1. Baseline: 1 GPU, batch 256, throughput 200 img/s, dataset 1.28M images.
  2. Move to 4 GPUs with the same per-GPU batch 256: global batch = 1024.
  3. Steps/epoch = 1,280,000 / 1024 = 1250.
  4. Estimate throughput: 4 × 200 ≈ 800 img/s. With mixed precision (+~30%) and 10% comm overhead: ≈ 800 × 1.3 × 0.9 ≈ 936 img/s (ballpark).
  5. Time/epoch ≈ 1,280,000 / 936 ≈ 1368 s ≈ 22.8 min. For 90 epochs ≈ 20–22 hours.
  6. LR: if baseline LR was 0.1, with global batch 4× larger, try LR ≈ 0.4 with warmup.

Numbers are approximations. Validate with a short dry run.

Example 2: Model doesn’t fit — pipeline parallel
  1. You have a ViT that OOMs on a single GPU.
  2. Split the model into two stages (e.g., encoder blocks 1–6 on GPU0, 7–12 on GPU1).
  3. Use micro-batches so while GPU0 processes micro-batch k+1, GPU1 processes k.
  4. Expect some pipeline bubble (idle time) at start/end; increase micro-batches to reduce it.
Example 3: Estimate communication vs compute
  1. 4 GPUs, gradients total 100 MB (FP32). Ring all-reduce per GPU cost ≈ 2*(4-1)/4*100 = 150 MB.
  2. On a 10 Gbps link (~1.25 GB/s), 150 MB takes ~0.12 s (120 ms).
  3. If compute per step is 300 ms, comm ≈ 28% of step time. Using FP16 halves comm to ~60 ms (~17%).

How to choose a strategy

  • If your model fits on one GPU and you want speed: data parallel (DDP).
  • If your model does NOT fit on one GPU: model or pipeline parallel (or hybrid).
  • If network is slow: reduce gradient size (mixed precision), overlap comm with compute, or use gradient accumulation to reduce sync frequency.
  • Start simple, measure, then optimize the bottleneck you see.

Performance quick math

  • Global batch = GPUs × per-GPU batch.
  • Steps per epoch = dataset_size / global_batch.
  • Speedup = T1 / TN; Efficiency = Speedup / N.
  • Ring all-reduce bytes per worker ≈ 2*(N-1)/N * tensor_bytes.
  • Linear LR scaling: new_LR ≈ base_LR × (new_global_batch / base_global_batch). Use warmup.

Hands-on DDP setup checklist

  • Seed everything (Python, framework, dataloader workers).
  • Shard data: each worker sees a unique subset each epoch.
  • Wrap model in DDP and move it to the right device.
  • Use mixed precision and gradient scaler.
  • Log per-worker metrics and averaged metrics; verify equal steps per worker.
  • Checkpoint with rank 0 only; make resume robust.

Exercises

Complete these to lock in the concepts. A quick test is available below; progress is saved only for logged-in users (anyone can take the test).

Exercise 1 — Plan a 4-GPU run with LR scaling

Given: 1 GPU baseline with batch 256, throughput 200 img/s, dataset 1.28M, 90 epochs. Move to 4 GPUs (DDP), keep per-GPU batch 256, use mixed precision.

  • Compute global batch and steps per epoch.
  • Estimate images/sec with a +30% AMP boost and 10% comm overhead.
  • Estimate time per epoch and total time.
  • Suggest a new LR from a base 0.1 using linear scaling.
Hint
  • Global batch = N × per-GPU batch.
  • Steps/epoch = dataset_size / global_batch.
  • Throughput estimate = N × singleGPU × 1.3 × 0.9.
  • LR scales with global batch ratio.

Exercise 2 — Communication budget

4 GPUs, 10 Gbps link (~1.25 GB/s). Gradient size 25M params in FP32. Estimate per-step ring all-reduce time and the effect of FP16 gradients.

  • Compute gradient size in MB.
  • Compute ring all-reduce bytes per GPU.
  • Compute comm time per step for FP32 and FP16.
Hint
  • Size = params × bytes_per_param.
  • Ring cost per worker = 2*(N-1)/N × size.
  • Time = bytes / bandwidth.
  • Self-check: Your numbers should be close to those in the solutions; small differences due to rounding are fine.
  • Track assumptions (e.g., bandwidth efficiency) explicitly.

Common mistakes and self-check

  • Duplicated data across workers. Self-check: count unique sample IDs per epoch; should match dataset size.
  • Mismatched seeds causing skewed shuffles. Self-check: log first 5 sample IDs per worker; verify no duplicates across workers in the same step.
  • Wrong LR after scaling batch size. Self-check: compare training loss curves vs baseline; too high LR often explodes early.
  • Underutilized GPUs due to slow dataloaders. Self-check: monitor GPU utilization; if low, increase workers, use caching, or prefetching.
  • Saving checkpoints from every rank. Self-check: ensure only one rank writes; others skip to avoid corruption.

Practical projects

  • Project A: Train ResNet on a public dataset with 1, 2, and 4 GPUs. Measure speedup, efficiency, and final accuracy; plot throughput vs GPUs.
  • Project B: Take a medium ViT that barely fits and try pipeline parallel across 2 GPUs. Tune micro-batches to reduce bubbles; compare step time.
  • Project C: Create a reproducible distributed training template: seeding, DDP setup, mixed precision, checkpointing, and a metrics logger.

Mini challenge

Your 4-GPU job shows 40% of step time in all-reduce. Propose two changes to reduce it, then predict the new step time if you cut comm by half and overlap 50% of the remainder with compute.

Example reasoning

Change to FP16 gradients and enable overlap. If comm is 120 ms, half to 60 ms; overlap 50% means 30 ms is hidden. Net visible comm = 30 ms.

Who this is for

  • Engineers training CV models who need faster iteration or larger batches.
  • Researchers scaling experiments across GPUs or nodes.
  • MLOps-minded developers building robust training pipelines.

Prerequisites

  • Comfort with training CNNs/Transformers on a single GPU.
  • Basic PyTorch or similar framework knowledge.
  • Understanding of batches, learning rate, and checkpoints.

Learning path

  • Before: GPU fundamentals, mixed precision, optimization basics.
  • This subskill: data vs model/pipeline parallel, sync vs async, comm costs, LR scaling, sharding, and robustness.
  • After: advanced sharding (optimizer/state), ZeRO-style partitioning, pipeline schedules, multi-node orchestration, and autoscaling.

Next steps

  • Implement DDP on a small project and record speedup and accuracy.
  • Try FP16 gradients and measure reduced comm time.
  • Take the Quick Test below to confirm understanding. Everyone can take it; progress is saved for logged-in users.

Practice Exercises

2 exercises to complete

Instructions

Given a 1-GPU baseline: batch 256, throughput 200 img/s, dataset 1.28M, 90 epochs. Move to 4 GPUs (DDP) with per-GPU batch 256 and mixed precision (+30% speed), assume 10% comm overhead.

  • Compute global batch and steps per epoch.
  • Estimate images/sec.
  • Estimate time per epoch and total training time.
  • Suggest a new LR from base 0.1 via linear scaling and mention warmup.
Expected Output
Global batch: 1024; Steps/epoch: 1250; Throughput ≈ 900–1000 img/s; Time/epoch ≈ 22–23 min; Total ≈ 20–22 h; New LR ≈ 0.4 with warmup.

Distributed Training Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Distributed Training Basics?

AI Assistant

Ask questions about this tool