luvv to helpDiscover the Best Free Online Tools
Topic 6 of 8

Distributed Training Basics

Learn Distributed Training Basics for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Who this is for

NLP Engineers training models beyond a single GPU, fine-tuning LLMs or sequence models, or preparing to scale experiments efficiently and reliably.

Prerequisites

  • Comfortable with training loops (forward, backward, optimizer step).
  • Understands batching, learning rate, and evaluation metrics.
  • Basic familiarity with mixed precision and gradient accumulation.

Why this matters

Real-world NLP models often exceed single-GPU memory and time budgets. You will need distributed training to:

  • Fine-tune large language models or multilingual encoders across many GPUs.
  • Cut training time from days to hours for rapid iteration.
  • Fit models that would otherwise OOM by sharding and streaming states.
  • Keep experiments reproducible and cost-effective at scale.

Concept explained simply

Distributed training means many workers train one model together. Each worker sees a slice of the data and contributes gradients that are combined to update shared weights.

Mental model

Imagine a study group solving a big problem set:

  • Data Parallelism: Everyone has the same problem sheet (model) but different questions (mini-batch). They compare answers (gradients) and agree on a final solution (averaged update).
  • Model/Tensor Parallelism: The sheet is too big to hold by one person, so the group splits the sheet itself (model layers or tensors across devices).
  • Pipeline Parallelism: The sheet is broken into sections (layers). Each person handles a section and passes work to the next, keeping the pipeline full with micro-batches.

Key terms and building blocks

  • Global batch size: per_gpu_batch × number_of_gpus × grad_accum_steps.
  • DDP (Distributed Data Parallel): Replicate model on each GPU; average gradients synchronously (often via AllReduce).
  • AllReduce: Collective operation that sums/averages tensors across workers; ring AllReduce is a common algorithm.
  • Synchronous vs Asynchronous: Sync waits for all workers each step; async can be faster but risks stale gradients.
  • Model parallelism: Split parameters across devices when a single device can't hold them.
  • Pipeline parallelism: Split layers into stages and stream micro-batches through them.
  • ZeRO/Optimizer state sharding: Shard optimizer states, gradients, and parameters across workers to save memory.
  • Mixed precision: Use lower-precision math to speed up training and reduce memory, with loss scaling to keep stability.
  • Communication/compute ratio: If communication time is too high relative to compute, scaling efficiency drops.
  • Topology awareness: Intra-node links are faster than inter-node; place highly chatty partitions close together.

Worked examples

Example 1 — Effective batch and LR scaling

You have 8 GPUs. Per-GPU batch = 16. Gradient accumulation = 2. Base learning rate is tuned for global batch 256.

  • Global batch = 8 × 16 × 2 = 256.
  • Since the new global batch equals the reference (256), learning rate can stay the same.

Rule of thumb: if you multiply global batch by k, try scaling LR by k (then fine-tune).

Example 2 — Estimate AllReduce time

Ring AllReduce rough time: T ≈ 2 × (N−1)/N × (message_size / bandwidth) + (N−1) × latency.

  • N=4 GPUs, message=400 MB, bandwidth=25 GB/s, latency=5 μs.
  • (message_size / bandwidth) = 0.4 / 25 = 0.016 s.
  • Factor = 2 × 3/4 = 1.5 → 1.5 × 0.016 = 0.024 s.
  • Latency term = 3 × 5e−6 ≈ 0.000015 s.
  • Total ≈ 0.024015 s ≈ 24 ms per gradient bucket.

Takeaway: Larger buckets amortize latency but increase memory; balance is key.

Example 3 — Memory fit with sharding

Model = 2B parameters. FP16 params (2 bytes) + FP16 grads (2 bytes) + FP32 master weights (4 bytes) + Adam m (4 bytes) + v (4 bytes) ≈ 16 bytes/param total → 32 GB.

  • With ZeRO stage-3 across 8 GPUs, shard factor ≈ 8 → 32 GB / 8 = 4 GB for model+states per GPU.
  • Add activations/buffers ≈ 3 GB → ≈ 7 GB/GPU.
  • On 16 GB GPUs, this fits with headroom for dataloader and CUDA context.

Takeaway: Rough accounting helps avoid runtime OOMs.

Hands-on: minimal distributed recipe (pseudocode)

# Pseudocode for synchronous distributed data parallel training
init_process_group()
set_device(local_rank)
model = build_model().to(device)
model = DDP(model, device_ids=[local_rank])
optimizer = Adam(model.parameters(), lr)
scaler = GradScaler()  # if using mixed precision
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, batch_size=per_gpu_bs, sampler=sampler)
for epoch in range(E):
    sampler.set_epoch(epoch)  # ensure different shuffles per epoch
    for step, batch in enumerate(loader):
        with autocast():
            loss = model(batch).loss / grad_accum
        scaler.scale(loss).backward()
        if (step + 1) % grad_accum == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad(set_to_none=True)
barrier()
destroy_process_group()
Notes
  • Use DistributedSampler to avoid sample duplication across ranks.
  • Call set_epoch(epoch) for proper shuffling each epoch.
  • Use gradient accumulation to reach your target global batch without exceeding memory.
  • Zero grads with set_to_none=True to reduce memory writes.

Common mistakes and self-checks

  • Forgetting DistributedSampler: leads to overlapping samples across GPUs. Self-check: Count unique sample IDs processed per epoch across ranks.
  • Incorrect global batch math: LR scaled but batch not, or vice versa. Self-check: Log per-GPU batch, grad_accum, world_size, computed global batch, and LR each run.
  • AllReduce on every tiny tensor: too many small communications. Fix: use gradient bucketing and fused ops.
  • No set_epoch in sampler: shuffling repeats each epoch. Fix: sampler.set_epoch(epoch).
  • Mismatched seeds or dropout across ranks during eval: leads to inconsistent metrics. Fix: set deterministic seeds and eval mode.
  • OOM from unsharded optimizer states: use ZeRO/sharding or CPU offload; enable mixed precision.

Exercises

These mirror the interactive exercises below. Try them here first, then submit your answers in the Exercise cards.

  1. Compute global batch and LR. 8 GPUs, per-GPU batch 16, grad accumulation 2. Base LR tuned for global batch 256. What are the global batch and recommended LR?
  2. Estimate AllReduce time. N=4, message=400 MB, bandwidth=25 GB/s, latency=5 μs. Approximate time per AllReduce?
  3. Memory fit with ZeRO. 2B params, ~16 bytes/param total, ZeRO-3 over 8 GPUs, plus 3 GB overhead per GPU. On 16 GB GPUs, does it fit? What’s the estimated memory per GPU?
Self-checklist
  • I can compute global batch and propose LR scaling.
  • I can estimate communication time and reason about bucket sizes.
  • I can budget memory with and without sharding.
  • I know when to choose data, model/tensor, or pipeline parallelism.
  • I can explain why DistributedSampler and set_epoch are required.

Mini challenge

You have 6 GPUs with 16 GB each and want to fine-tune a 1B-parameter model using Adam in mixed precision within 2 hours:

  • Propose a parallelism plan (data parallel only? add sharding? any pipeline?).
  • Pick a global batch and LR scaling rule. Justify.
  • Suggest 3 logging metrics to verify scaling efficiency (e.g., step time, AllReduce time, GPU utilization).
Suggested direction

Start with DDP + ZeRO-2 or ZeRO-3, choose global batch via per-GPU memory tests, and track comm/compute ratio. If step time stalls, investigate bucket sizes and overlapping comm/compute.

Practical projects

  • Scale a sequence classifier from 1 to 4 GPUs, maintain accuracy while reducing wall-clock time by at least 2×.
  • Sharded fine-tune of a 1B-parameter model; report memory before/after sharding and final validation score.
  • Implement gradient bucketing and measure impact on step time for small vs large buckets.

Learning path

  • Before: Mixed Precision Fundamentals → Efficient Dataloading.
  • Now: Distributed Training Basics (this lesson).
  • Next: Advanced Sharding and Pipeline Parallelism → Fault Tolerance and Checkpointing → Multi-node Performance Tuning.

Next steps

  • Try the exercises below, then take the Quick Test.
  • Note: The Quick Test is available to everyone; only logged-in users get saved progress and completion tracking.

Practice Exercises

3 exercises to complete

Instructions

You run DDP on 8 GPUs. Per-GPU batch size is 16. Gradient accumulation is 2. The base learning rate (LR) was tuned for a global batch of 256. Compute:

  1. The global batch size.
  2. The recommended LR using linear scaling from the 256 reference.
Expected Output
Global batch = 256; Recommended LR = base LR (unchanged).

Distributed Training Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Distributed Training Basics?

AI Assistant

Ask questions about this tool