Topic Not Found

Why this matters

As an Applied Scientist, you often need to train models faster, larger, or both. Distributed training lets you scale across multiple GPUs or machines to:

Fit larger models that exceed a single GPU’s memory.
Reduce training time to meet deadlines and iterate faster.
Run bigger experiments (larger batch sizes) for more stable gradients.
Deploy efficient training pipelines that save compute cost.

Real tasks you’ll do: choose a parallelism strategy, estimate speedup and costs, configure batch size and learning rate for multi-GPU, and debug convergence issues after scaling.

Concept explained simply

Simple idea

Distributed training splits the work of training a model across multiple devices. You either split the data (each device sees different samples), split the model (each device holds different parts of the model), or both. Devices periodically communicate to agree on updates.

Mental model

Think of a team writing a book:

Data parallelism: Everyone edits a full copy of the book, then merges changes after each round.
Model parallelism: Each person owns a chapter; pages flow from one person to the next.
Pipeline parallelism: While one person edits chapter 1, another edits chapter 2, etc., in a staggered fashion to keep everyone busy.

Core building blocks

Data parallelism (DP)

Each GPU has a full model copy, processes a different mini-batch, and gradients are averaged (usually via all-reduce). Simple and common.

Pros: Easy to use, scales well to many GPUs when communication is fast.
Cons: Model must fit on one GPU; communication cost grows with parameter count.

Model parallelism (MP)

Split model layers or tensors across devices so a single forward/backward pass spans multiple GPUs.

Pros: Fits very large models.
Cons: More complex, higher communication overhead, harder to debug.

Pipeline parallelism (PP)

Partition the model into stages and send micro-batches through in a pipeline to keep GPUs utilized.

Pros: Good utilization for deep models, memory scalable.
Cons: Pipeline bubbles reduce efficiency; needs tuning of micro-batches.

Sharded training (e.g., ZeRO)

Shard optimizer states, gradients, and sometimes parameters across data-parallel ranks to reduce per-GPU memory.

Stage 1: Shard optimizer states.
Stage 2: Shard optimizer states + gradients.
Stage 3: Also shard model parameters.

Synchronization patterns

Synchronous SGD: all workers average gradients each step (deterministic, stable). Asynchronous SGD: workers push/pull params at different times (faster on unreliable networks, but may introduce stale gradients).

Communication backbones

Collective ops like all-reduce, reduce-scatter, and all-gather dominate overhead. Topologies (NVLink, PCIe, Infiniband) and libraries (e.g., NCCL) matter. Ring all-reduce is bandwidth-efficient and widely used.

Batch size, LR scaling, and stability

Global batch = per-GPU batch × number of GPUs × gradient accumulation steps. A common rule is linear LR scaling with warmup when increasing global batch. Watch for loss spikes; adjust LR, warmup, and regularization.

Precision and memory

Mixed precision (FP16/BF16) speeds up training and reduces memory. Activation checkpointing trades compute for memory. Gradient clipping helps stability when scaling up.

Throughput vs. efficiency

Speedup is limited by serial or communication-bound parts (Amdahl’s law). Aim to overlap compute and communication, keep batches large enough for high GPU utilization, and minimize cross-node transfers.

Worked examples

Example 1: Compute global batch and LR scaling

Setup: 8 GPUs, per-GPU batch = 32, gradient accumulation = 4. Base config: global batch 256 with LR 0.001 and linear scaling/warmup.

Global batch = 8 × 32 × 4 = 1024.
Scaled LR = 0.001 × (1024 / 256) = 0.004.
If warmup over 500 steps, at step 250 LR ≈ 0.002.

Interpretation: Keep the optimizer stable by linearly increasing LR during warmup; monitor loss for spikes.

Example 2: Pick a parallelism strategy for a big model

Model: 3B params, BF16. You have 4 × 24GB GPUs. Pure DP requires full params per GPU and optimizer states; it won’t fit. Use sharding (ZeRO-2 or -3) and/or pipeline parallelism.

Option A: DP + ZeRO-2 + activation checkpointing.
Option B: 2-way PP × 2-way DP + ZeRO-1 for balance.

Decision: Start with Option A (simpler). If utilization is poor or memory still tight, move to Option B.

Example 3: Diagnose instability after scaling

Symptom: Loss diverges when moving from 1 to 8 GPUs.

Check LR scaling and warmup: use linear scaling and warmup for a few hundred steps.
Enable gradient clipping (e.g., 1.0).
Ensure consistent seeds and synchronized BatchNorm (or switch to SyncBN / GN).
Verify mixed precision settings and loss scaling are correct.

Likely fix: Reduce LR or increase warmup; add clipping; confirm all-reduce is configured correctly.

Practical setup checklist

[ ] Confirm model fits: with mixed precision and checkpointing if needed.
[ ] Decide strategy: DP first; add ZeRO or PP/MP if memory-bound.
[ ] Compute global batch and apply LR scaling + warmup.
[ ] Enable gradient clipping and deterministic seeds.
[ ] Verify collective backend works (all-reduce sanity test).
[ ] Set sensible checkpoint frequency for fault tolerance.
[ ] Record throughput (samples/sec) and utilization to track efficiency.

Exercises

Exercise 1 (ex1): Scale batch size and learning rate

You have 8 GPUs, per-GPU batch 32, gradient accumulation 4. Base LR is 0.001 at global batch 256. You use linear LR scaling with a 500-step warmup to the target LR. Calculate:

Global batch size
Target LR after scaling
LR at step 250 during warmup

Need a hint?

Global batch = GPUs × per-GPU batch × accumulation.
Linear scaling: new LR = base LR × (new global batch / base global batch). Warmup is linear from 0 to target.

Exercise 2 (ex2): Choose a feasible parallelism plan

Model: 3B parameters, BF16, Adam optimizer. Hardware: 4 GPUs, 24 GB each. Goal: fit training and maintain good throughput. Propose a feasible plan that includes parallelism type(s), sharding level, and at least one memory-saving technique. Briefly justify.

Need a hint?

Sharding optimizer states (ZeRO-2) drastically reduces memory per GPU.
Activation checkpointing trades compute for memory; pipeline parallelism reduces per-GPU model footprint.

Common mistakes and self-check

Incorrect LR after scaling

Symptom: sudden loss spikes. Self-check: recompute global batch and confirm LR scaling and warmup schedule.

Ignoring communication bottlenecks

Symptom: low GPU utilization. Self-check: measure all-reduce time, try larger micro-batches, overlap compute/comm.

No gradient clipping at large batch

Symptom: occasional divergence. Self-check: enable clipping (e.g., 1.0) and compare stability.

BatchNorm sync issues

Symptom: different results across ranks. Self-check: use synchronized BN or switch to GroupNorm/LayerNorm.

Under-checkpointing

Symptom: lost progress on failure. Self-check: checkpoint every N steps and test restore.

Mini challenge

Your experiment must finish 2× faster than single-GPU training. You have 4 GPUs on one node. Propose a configuration (parallelism, global batch, LR policy, precision, and any overlap tricks) that likely achieves near-2× speedup while keeping training stable. Explain your choices in 3–5 sentences.

Who this is for

Applied Scientists and ML Engineers who train deep models on GPUs.
Researchers moving from single-GPU notebooks to multi-GPU experiments.

Prerequisites

Comfortable with training loops, optimizers (SGD/Adam), and LR schedules.
Basic understanding of tensors, layers, and backpropagation.
Familiarity with mixed precision training.

Learning path

Start: Data parallelism and all-reduce; compute global batch and LR scaling.
Add: Mixed precision + gradient clipping + activation checkpointing.
Advance: Sharding (ZeRO), pipeline/model parallelism basics.
Optimize: Overlap compute/comm, tune micro-batches, monitor utilization.
Harden: Checkpointing, determinism, and reproducibility across runs.

Practical projects

Project 1: Convert a single-GPU image classifier to 4-GPU data parallel. Report speedup, utilization, and final accuracy.
Project 2: Train a medium transformer with ZeRO-2 on 4 GPUs. Compare memory usage with and without activation checkpointing.
Project 3: Implement a 2-stage pipeline for a deep model. Tune micro-batch count to minimize pipeline bubbles.

Next steps

Run the exercises and validate your numbers/plan.
Complete the quick test to check understanding. Everyone can take it; only logged-in users have saved progress.
Apply the checklist to your current training job and log improvements.

Quick Test

Answer short questions to verify your understanding. Your progress is saved only if you are logged in.

Menu

Distributed Training Basics

Table of Contents