Topic Not Found

Who this is for

NLP Engineers training models beyond a single GPU, fine-tuning LLMs or sequence models, or preparing to scale experiments efficiently and reliably.

Prerequisites

Comfortable with training loops (forward, backward, optimizer step).
Understands batching, learning rate, and evaluation metrics.
Basic familiarity with mixed precision and gradient accumulation.

Why this matters

Real-world NLP models often exceed single-GPU memory and time budgets. You will need distributed training to:

Fine-tune large language models or multilingual encoders across many GPUs.
Cut training time from days to hours for rapid iteration.
Fit models that would otherwise OOM by sharding and streaming states.
Keep experiments reproducible and cost-effective at scale.

Concept explained simply

Distributed training means many workers train one model together. Each worker sees a slice of the data and contributes gradients that are combined to update shared weights.

Mental model

Imagine a study group solving a big problem set:

Data Parallelism: Everyone has the same problem sheet (model) but different questions (mini-batch). They compare answers (gradients) and agree on a final solution (averaged update).
Model/Tensor Parallelism: The sheet is too big to hold by one person, so the group splits the sheet itself (model layers or tensors across devices).
Pipeline Parallelism: The sheet is broken into sections (layers). Each person handles a section and passes work to the next, keeping the pipeline full with micro-batches.

Key terms and building blocks

Global batch size: per_gpu_batch × number_of_gpus × grad_accum_steps.
DDP (Distributed Data Parallel): Replicate model on each GPU; average gradients synchronously (often via AllReduce).
AllReduce: Collective operation that sums/averages tensors across workers; ring AllReduce is a common algorithm.
Synchronous vs Asynchronous: Sync waits for all workers each step; async can be faster but risks stale gradients.
Model parallelism: Split parameters across devices when a single device can't hold them.
Pipeline parallelism: Split layers into stages and stream micro-batches through them.
ZeRO/Optimizer state sharding: Shard optimizer states, gradients, and parameters across workers to save memory.
Mixed precision: Use lower-precision math to speed up training and reduce memory, with loss scaling to keep stability.
Communication/compute ratio: If communication time is too high relative to compute, scaling efficiency drops.
Topology awareness: Intra-node links are faster than inter-node; place highly chatty partitions close together.

Worked examples

Example 1 — Effective batch and LR scaling

You have 8 GPUs. Per-GPU batch = 16. Gradient accumulation = 2. Base learning rate is tuned for global batch 256.

Global batch = 8 × 16 × 2 = 256.
Since the new global batch equals the reference (256), learning rate can stay the same.

Rule of thumb: if you multiply global batch by k, try scaling LR by k (then fine-tune).

Example 2 — Estimate AllReduce time

Ring AllReduce rough time: T ≈ 2 × (N−1)/N × (message_size / bandwidth) + (N−1) × latency.

N=4 GPUs, message=400 MB, bandwidth=25 GB/s, latency=5 μs.
(message_size / bandwidth) = 0.4 / 25 = 0.016 s.
Factor = 2 × 3/4 = 1.5 → 1.5 × 0.016 = 0.024 s.
Latency term = 3 × 5e−6 ≈ 0.000015 s.
Total ≈ 0.024015 s ≈ 24 ms per gradient bucket.

Takeaway: Larger buckets amortize latency but increase memory; balance is key.

Example 3 — Memory fit with sharding

Model = 2B parameters. FP16 params (2 bytes) + FP16 grads (2 bytes) + FP32 master weights (4 bytes) + Adam m (4 bytes) + v (4 bytes) ≈ 16 bytes/param total → 32 GB.

With ZeRO stage-3 across 8 GPUs, shard factor ≈ 8 → 32 GB / 8 = 4 GB for model+states per GPU.
Add activations/buffers ≈ 3 GB → ≈ 7 GB/GPU.
On 16 GB GPUs, this fits with headroom for dataloader and CUDA context.

Takeaway: Rough accounting helps avoid runtime OOMs.

Hands-on: minimal distributed recipe (pseudocode)

# Pseudocode for synchronous distributed data parallel training
init_process_group()
set_device(local_rank)
model = build_model().to(device)
model = DDP(model, device_ids=[local_rank])
optimizer = Adam(model.parameters(), lr)
scaler = GradScaler()  # if using mixed precision
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, batch_size=per_gpu_bs, sampler=sampler)
for epoch in range(E):
    sampler.set_epoch(epoch)  # ensure different shuffles per epoch
    for step, batch in enumerate(loader):
        with autocast():
            loss = model(batch).loss / grad_accum
        scaler.scale(loss).backward()
        if (step + 1) % grad_accum == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad(set_to_none=True)
barrier()
destroy_process_group()

Notes

Use DistributedSampler to avoid sample duplication across ranks.
Call set_epoch(epoch) for proper shuffling each epoch.
Use gradient accumulation to reach your target global batch without exceeding memory.
Zero grads with set_to_none=True to reduce memory writes.

Common mistakes and self-checks

Forgetting DistributedSampler: leads to overlapping samples across GPUs. Self-check: Count unique sample IDs processed per epoch across ranks.
Incorrect global batch math: LR scaled but batch not, or vice versa. Self-check: Log per-GPU batch, grad_accum, world_size, computed global batch, and LR each run.
AllReduce on every tiny tensor: too many small communications. Fix: use gradient bucketing and fused ops.
No set_epoch in sampler: shuffling repeats each epoch. Fix: sampler.set_epoch(epoch).
Mismatched seeds or dropout across ranks during eval: leads to inconsistent metrics. Fix: set deterministic seeds and eval mode.
OOM from unsharded optimizer states: use ZeRO/sharding or CPU offload; enable mixed precision.

Exercises

These mirror the interactive exercises below. Try them here first, then submit your answers in the Exercise cards.

Compute global batch and LR. 8 GPUs, per-GPU batch 16, grad accumulation 2. Base LR tuned for global batch 256. What are the global batch and recommended LR?
Estimate AllReduce time. N=4, message=400 MB, bandwidth=25 GB/s, latency=5 μs. Approximate time per AllReduce?
Memory fit with ZeRO. 2B params, ~16 bytes/param total, ZeRO-3 over 8 GPUs, plus 3 GB overhead per GPU. On 16 GB GPUs, does it fit? What’s the estimated memory per GPU?

Self-checklist

I can compute global batch and propose LR scaling.
I can estimate communication time and reason about bucket sizes.
I can budget memory with and without sharding.
I know when to choose data, model/tensor, or pipeline parallelism.
I can explain why DistributedSampler and set_epoch are required.

Mini challenge

You have 6 GPUs with 16 GB each and want to fine-tune a 1B-parameter model using Adam in mixed precision within 2 hours:

Propose a parallelism plan (data parallel only? add sharding? any pipeline?).
Pick a global batch and LR scaling rule. Justify.
Suggest 3 logging metrics to verify scaling efficiency (e.g., step time, AllReduce time, GPU utilization).

Suggested direction

Start with DDP + ZeRO-2 or ZeRO-3, choose global batch via per-GPU memory tests, and track comm/compute ratio. If step time stalls, investigate bucket sizes and overlapping comm/compute.

Practical projects

Scale a sequence classifier from 1 to 4 GPUs, maintain accuracy while reducing wall-clock time by at least 2×.
Sharded fine-tune of a 1B-parameter model; report memory before/after sharding and final validation score.
Implement gradient bucketing and measure impact on step time for small vs large buckets.

Learning path

Before: Mixed Precision Fundamentals → Efficient Dataloading.
Now: Distributed Training Basics (this lesson).
Next: Advanced Sharding and Pipeline Parallelism → Fault Tolerance and Checkpointing → Multi-node Performance Tuning.

Next steps

Try the exercises below, then take the Quick Test.
Note: The Quick Test is available to everyone; only logged-in users get saved progress and completion tracking.

Menu

Distributed Training Basics

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Key terms and building blocks

Worked examples

Hands-on: minimal distributed recipe (pseudocode)

Common mistakes and self-checks

Exercises

Mini challenge

Practical projects

Learning path

Next steps

Practice Exercises

Global batch and LR scaling

Instructions

Expected Output

AllReduce time estimate

Memory fit with ZeRO sharding

Distributed Training Basics — Quick Test

Have questions about Distributed Training Basics?

AI Assistant