Who this is for
NLP Engineers training models beyond a single GPU, fine-tuning LLMs or sequence models, or preparing to scale experiments efficiently and reliably.
Prerequisites
- Comfortable with training loops (forward, backward, optimizer step).
- Understands batching, learning rate, and evaluation metrics.
- Basic familiarity with mixed precision and gradient accumulation.
Why this matters
Real-world NLP models often exceed single-GPU memory and time budgets. You will need distributed training to:
- Fine-tune large language models or multilingual encoders across many GPUs.
- Cut training time from days to hours for rapid iteration.
- Fit models that would otherwise OOM by sharding and streaming states.
- Keep experiments reproducible and cost-effective at scale.
Concept explained simply
Distributed training means many workers train one model together. Each worker sees a slice of the data and contributes gradients that are combined to update shared weights.
Mental model
Imagine a study group solving a big problem set:
- Data Parallelism: Everyone has the same problem sheet (model) but different questions (mini-batch). They compare answers (gradients) and agree on a final solution (averaged update).
- Model/Tensor Parallelism: The sheet is too big to hold by one person, so the group splits the sheet itself (model layers or tensors across devices).
- Pipeline Parallelism: The sheet is broken into sections (layers). Each person handles a section and passes work to the next, keeping the pipeline full with micro-batches.
Key terms and building blocks
- Global batch size: per_gpu_batch × number_of_gpus × grad_accum_steps.
- DDP (Distributed Data Parallel): Replicate model on each GPU; average gradients synchronously (often via AllReduce).
- AllReduce: Collective operation that sums/averages tensors across workers; ring AllReduce is a common algorithm.
- Synchronous vs Asynchronous: Sync waits for all workers each step; async can be faster but risks stale gradients.
- Model parallelism: Split parameters across devices when a single device can't hold them.
- Pipeline parallelism: Split layers into stages and stream micro-batches through them.
- ZeRO/Optimizer state sharding: Shard optimizer states, gradients, and parameters across workers to save memory.
- Mixed precision: Use lower-precision math to speed up training and reduce memory, with loss scaling to keep stability.
- Communication/compute ratio: If communication time is too high relative to compute, scaling efficiency drops.
- Topology awareness: Intra-node links are faster than inter-node; place highly chatty partitions close together.
Worked examples
Example 1 — Effective batch and LR scaling
You have 8 GPUs. Per-GPU batch = 16. Gradient accumulation = 2. Base learning rate is tuned for global batch 256.
- Global batch = 8 × 16 × 2 = 256.
- Since the new global batch equals the reference (256), learning rate can stay the same.
Rule of thumb: if you multiply global batch by k, try scaling LR by k (then fine-tune).
Example 2 — Estimate AllReduce time
Ring AllReduce rough time: T ≈ 2 × (N−1)/N × (message_size / bandwidth) + (N−1) × latency.
- N=4 GPUs, message=400 MB, bandwidth=25 GB/s, latency=5 μs.
- (message_size / bandwidth) = 0.4 / 25 = 0.016 s.
- Factor = 2 × 3/4 = 1.5 → 1.5 × 0.016 = 0.024 s.
- Latency term = 3 × 5e−6 ≈ 0.000015 s.
- Total ≈ 0.024015 s ≈ 24 ms per gradient bucket.
Takeaway: Larger buckets amortize latency but increase memory; balance is key.
Example 3 — Memory fit with sharding
Model = 2B parameters. FP16 params (2 bytes) + FP16 grads (2 bytes) + FP32 master weights (4 bytes) + Adam m (4 bytes) + v (4 bytes) ≈ 16 bytes/param total → 32 GB.
- With ZeRO stage-3 across 8 GPUs, shard factor ≈ 8 → 32 GB / 8 = 4 GB for model+states per GPU.
- Add activations/buffers ≈ 3 GB → ≈ 7 GB/GPU.
- On 16 GB GPUs, this fits with headroom for dataloader and CUDA context.
Takeaway: Rough accounting helps avoid runtime OOMs.
Hands-on: minimal distributed recipe (pseudocode)
# Pseudocode for synchronous distributed data parallel training
init_process_group()
set_device(local_rank)
model = build_model().to(device)
model = DDP(model, device_ids=[local_rank])
optimizer = Adam(model.parameters(), lr)
scaler = GradScaler() # if using mixed precision
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, batch_size=per_gpu_bs, sampler=sampler)
for epoch in range(E):
sampler.set_epoch(epoch) # ensure different shuffles per epoch
for step, batch in enumerate(loader):
with autocast():
loss = model(batch).loss / grad_accum
scaler.scale(loss).backward()
if (step + 1) % grad_accum == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
barrier()
destroy_process_group()Notes
- Use DistributedSampler to avoid sample duplication across ranks.
- Call set_epoch(epoch) for proper shuffling each epoch.
- Use gradient accumulation to reach your target global batch without exceeding memory.
- Zero grads with set_to_none=True to reduce memory writes.
Common mistakes and self-checks
- Forgetting DistributedSampler: leads to overlapping samples across GPUs. Self-check: Count unique sample IDs processed per epoch across ranks.
- Incorrect global batch math: LR scaled but batch not, or vice versa. Self-check: Log per-GPU batch, grad_accum, world_size, computed global batch, and LR each run.
- AllReduce on every tiny tensor: too many small communications. Fix: use gradient bucketing and fused ops.
- No set_epoch in sampler: shuffling repeats each epoch. Fix: sampler.set_epoch(epoch).
- Mismatched seeds or dropout across ranks during eval: leads to inconsistent metrics. Fix: set deterministic seeds and eval mode.
- OOM from unsharded optimizer states: use ZeRO/sharding or CPU offload; enable mixed precision.
Exercises
These mirror the interactive exercises below. Try them here first, then submit your answers in the Exercise cards.
- Compute global batch and LR. 8 GPUs, per-GPU batch 16, grad accumulation 2. Base LR tuned for global batch 256. What are the global batch and recommended LR?
- Estimate AllReduce time. N=4, message=400 MB, bandwidth=25 GB/s, latency=5 μs. Approximate time per AllReduce?
- Memory fit with ZeRO. 2B params, ~16 bytes/param total, ZeRO-3 over 8 GPUs, plus 3 GB overhead per GPU. On 16 GB GPUs, does it fit? What’s the estimated memory per GPU?
Self-checklist
- I can compute global batch and propose LR scaling.
- I can estimate communication time and reason about bucket sizes.
- I can budget memory with and without sharding.
- I know when to choose data, model/tensor, or pipeline parallelism.
- I can explain why DistributedSampler and set_epoch are required.
Mini challenge
You have 6 GPUs with 16 GB each and want to fine-tune a 1B-parameter model using Adam in mixed precision within 2 hours:
- Propose a parallelism plan (data parallel only? add sharding? any pipeline?).
- Pick a global batch and LR scaling rule. Justify.
- Suggest 3 logging metrics to verify scaling efficiency (e.g., step time, AllReduce time, GPU utilization).
Suggested direction
Start with DDP + ZeRO-2 or ZeRO-3, choose global batch via per-GPU memory tests, and track comm/compute ratio. If step time stalls, investigate bucket sizes and overlapping comm/compute.
Practical projects
- Scale a sequence classifier from 1 to 4 GPUs, maintain accuracy while reducing wall-clock time by at least 2×.
- Sharded fine-tune of a 1B-parameter model; report memory before/after sharding and final validation score.
- Implement gradient bucketing and measure impact on step time for small vs large buckets.
Learning path
- Before: Mixed Precision Fundamentals → Efficient Dataloading.
- Now: Distributed Training Basics (this lesson).
- Next: Advanced Sharding and Pipeline Parallelism → Fault Tolerance and Checkpointing → Multi-node Performance Tuning.
Next steps
- Try the exercises below, then take the Quick Test.
- Note: The Quick Test is available to everyone; only logged-in users get saved progress and completion tracking.