Why this matters
NLP models like BERT and Llama are memory-hungry. Choosing the right batch size and using gradient accumulation lets you:
- Train larger models on limited GPUs without out-of-memory (OOM) errors.
- Stabilize training by simulating a larger batch size for smoother gradients.
- Speed up experimentation by balancing throughput, memory, and convergence.
- Reproduce results across single- and multi-GPU setups using consistent effective batch size.
Concept explained simply
Two key ideas:
- Micro-batch size: how many samples you process before calling
backward(). - Gradient accumulation: sum gradients over several micro-batches, then take one optimizer step. This simulates a larger batch without holding all samples in memory.
Formula: effective/global batch size
Global batch size = micro_batch_size × accumulation_steps × number_of_devices.
Example: micro 8 × accum 4 × 2 GPUs = global 64.
Mental model
Imagine pouring water (gradients) from small cups (micro-batches) into a bucket. When the bucket is full (after accumulation_steps), you pour it out once (optimizer.step). Bigger bucket = smoother updates.
When to scale the loss
To average gradients across accumulated micro-batches, divide the loss by accumulation_steps before calling backward() (or equivalently average gradients after).
Worked examples
Example 1: Single GPU fine-tuning BERT
Goal: global batch 64 on a single 12GB GPU. You can fit micro-batch 8.
- Given: devices=1, micro=8, want global=64.
- accumulation_steps = 64 / (8 × 1) = 8.
- Call
optimizer.step()every 8 micro-batches.
Example 2: Two GPUs with DDP
Goal: global 128. Each GPU fits micro 4.
- Given: devices=2, micro=4, want global=128.
- accumulation_steps = 128 / (4 × 2) = 16.
- Each GPU processes 4 × 16 = 64 samples per optimizer step; combined = 128.
Example 3: Longer sequences force smaller micro-batch
Baseline: micro 16 at sequence length 128. You switch to length 512 and now only micro 4 fits. You want to keep global 128 on 1 GPU.
- accumulation_steps = 128 / (4 × 1) = 32.
- Consider mixed precision and gradient checkpointing to reduce steps if training is too slow.
How to apply (step-by-step)
- Pick a target global batch size that stabilizes training (typical NLP fine-tuning: 32–256).
- Find the largest micro-batch size that fits in memory without OOM.
- Compute accumulation_steps using the formula. Round up if not divisible.
- In your loop:
- Call
loss = loss / accumulation_steps, thenloss.backward(). - Every
accumulation_stepsiterations: clip gradients (optional),optimizer.step(),optimizer.zero_grad(), and update LR scheduler once.
- Call
- Log steps in terms of optimizer steps (not micro-batches).
Minimal PyTorch pattern
accum = 8
optimizer.zero_grad()
for step, batch in enumerate(loader, start=1):
inputs, labels = batch
with torch.cuda.amp.autocast(enabled=use_amp):
loss = model(inputs, labels)
loss = loss / accum
scaler.scale(loss).backward()
if step % accum == 0:
if clip_val is not None:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_val)
scaler.step(optimizer)
scaler.update()
scheduler.step()
optimizer.zero_grad()Practical parameter cheatsheet
- Global batch (fine-tuning Transformers): 32–256.
- Micro-batch: as large as fits; often 4–32 for sequence length 128–512.
- Accumulation steps: 1–32+ (higher if sequences are long).
- Learning rate: increase roughly proportionally with global batch; verify with a short LR range test.
- Warmup: define in optimizer steps (not micro-batches). E.g., 500 optimizer steps, not 500 iterations with accumulation.
- Gradient clipping: 0.5–1.0 commonly stabilizes fine-tuning.
- Mixed precision + gradient checkpointing can reduce accumulation needs.
Common mistakes and self-checks
- Mistake: Not dividing loss by accumulation steps. Self-check: Log effective LR and gradient norms; if they explode after enabling accumulation, you likely forgot this.
- Mistake: Scheduling LR/warmup per micro-batch. Self-check: Ensure
scheduler.step()runs only afteroptimizer.step(). - Mistake: Zeroing grads every micro-batch. Self-check: Grads should accumulate across micro-batches; only zero after optimizer.step.
- Mistake: Mismatched global batch between single- and multi-GPU runs. Self-check: Recompute using the formula whenever devices change.
- Mistake: Assuming BatchNorm will behave the same with tiny micro-batches. Note: Transformers use LayerNorm, so this is less of an issue in NLP.
Quick self-diagnostics checklist
- Loss divided by
accumulation_steps(or gradients averaged) before backward -
optimizer.step()andscheduler.step()only after N micro-batches -
optimizer.zero_grad()immediately after optimizer.step() - Global batch recomputed when changing devices or micro-batch
- Warmup and logging defined in optimizer steps
Exercises
Do these, then check solutions below. The same tasks appear in the Exercises panel for tracking.
Exercise 1 — Compute accumulation steps and LR
You have 1 GPU. You can fit micro-batch = 8. You want global batch = 128. Your baseline learning rate for global 32 was 2e-5. Find accumulation_steps and a reasonable new LR.
Hint
- Use the global batch formula.
- LR often scales linearly with global batch (verify stability).
Exercise 2 — Write the accumulation loop
Write a minimal training loop that correctly implements gradient accumulation with loss scaling, gradient clipping, and stepping the scheduler only after an optimizer step.
Hint
- Divide loss by
accumulation_stepsbefore backward. - Zero grads only after
optimizer.step().
Checklist before you run
- Loss division or gradient averaging in place
- Optimizer and scheduler step frequency aligned
- Gradient clipping after unscale (if AMP)
- Logging based on optimizer steps
Mini challenge
Your NER fine-tuning OOMs at sequence length 512. You currently use micro 8, accum 4, 1 GPU, LR 3e-5. Propose a plan to keep global batch ≈ 64 while avoiding OOM, and list two checks to ensure training stays stable.
Possible approach
- Reduce micro to 2, increase accum to 32 (global = 2 × 32 × 1 = 64).
- Enable mixed precision and gradient checkpointing.
- Keep LR similar; verify with a short LR range test.
- Monitor gradient norms and validation loss for stability.
Who this is for
- NLP Engineers fine-tuning or pretraining Transformer models.
- Data Scientists moving from CPU-only prototypes to GPU training.
Prerequisites
- Basic PyTorch or similar DL framework knowledge.
- Understanding of training loops, optimizers, and LR scheduling.
Learning path
- Start: Batch size and gradient accumulation (this page).
- Next: Mixed precision training and gradient scaling.
- Then: Distributed data parallel and gradient checkpointing.
- Finally: Scheduling strategies and throughput optimization.
Practical projects
- Project 1: Fine-tune a Transformer classifier while sweeping micro-batch and accumulation to match a target global batch; plot loss vs. wall-clock time.
- Project 2: Long-sequence QA: compare 128 vs. 512 token contexts; adjust accumulation to hold global batch constant; report convergence and stability.
- Project 3: Multi-GPU replication: reproduce single-GPU results on 2–4 GPUs by recomputing accumulation and verifying identical validation metrics.
Next steps
- Adopt mixed precision and gradient checkpointing to reduce memory.
- Profile data pipelines to avoid starving GPUs.
- Experiment with LR scaling rules and warmup in optimizer steps.
Quick Test
The quick test below is available to everyone. Only logged-in users will see saved progress on their dashboard.