How to learn Batch Size And Gradient Accumulation for Training And Optimization in NLP Engineer for free

Why this matters

NLP models like BERT and Llama are memory-hungry. Choosing the right batch size and using gradient accumulation lets you:

Train larger models on limited GPUs without out-of-memory (OOM) errors.
Stabilize training by simulating a larger batch size for smoother gradients.
Speed up experimentation by balancing throughput, memory, and convergence.
Reproduce results across single- and multi-GPU setups using consistent effective batch size.

Concept explained simply

Two key ideas:

Micro-batch size: how many samples you process before calling backward().
Gradient accumulation: sum gradients over several micro-batches, then take one optimizer step. This simulates a larger batch without holding all samples in memory.

Formula: effective/global batch size

Global batch size = micro_batch_size × accumulation_steps × number_of_devices.

Example: micro 8 × accum 4 × 2 GPUs = global 64.

Mental model

Imagine pouring water (gradients) from small cups (micro-batches) into a bucket. When the bucket is full (after accumulation_steps), you pour it out once (optimizer.step). Bigger bucket = smoother updates.

When to scale the loss

To average gradients across accumulated micro-batches, divide the loss by accumulation_steps before calling backward() (or equivalently average gradients after).

Worked examples

Example 1: Single GPU fine-tuning BERT

Goal: global batch 64 on a single 12GB GPU. You can fit micro-batch 8.

Given: devices=1, micro=8, want global=64.
accumulation_steps = 64 / (8 × 1) = 8.
Call optimizer.step() every 8 micro-batches.

Example 2: Two GPUs with DDP

Goal: global 128. Each GPU fits micro 4.

Given: devices=2, micro=4, want global=128.
accumulation_steps = 128 / (4 × 2) = 16.
Each GPU processes 4 × 16 = 64 samples per optimizer step; combined = 128.

Example 3: Longer sequences force smaller micro-batch

Baseline: micro 16 at sequence length 128. You switch to length 512 and now only micro 4 fits. You want to keep global 128 on 1 GPU.

accumulation_steps = 128 / (4 × 1) = 32.
Consider mixed precision and gradient checkpointing to reduce steps if training is too slow.

How to apply (step-by-step)

Pick a target global batch size that stabilizes training (typical NLP fine-tuning: 32–256).
Find the largest micro-batch size that fits in memory without OOM.
Compute accumulation_steps using the formula. Round up if not divisible.
In your loop:
- Call loss = loss / accumulation_steps, then loss.backward().
- Every accumulation_steps iterations: clip gradients (optional), optimizer.step(), optimizer.zero_grad(), and update LR scheduler once.
Log steps in terms of optimizer steps (not micro-batches).

Minimal PyTorch pattern

accum = 8
optimizer.zero_grad()
for step, batch in enumerate(loader, start=1):
    inputs, labels = batch
    with torch.cuda.amp.autocast(enabled=use_amp):
        loss = model(inputs, labels)
        loss = loss / accum
    scaler.scale(loss).backward()
    if step % accum == 0:
        if clip_val is not None:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip_val)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        optimizer.zero_grad()

Practical parameter cheatsheet

Global batch (fine-tuning Transformers): 32–256.
Micro-batch: as large as fits; often 4–32 for sequence length 128–512.
Accumulation steps: 1–32+ (higher if sequences are long).
Learning rate: increase roughly proportionally with global batch; verify with a short LR range test.
Warmup: define in optimizer steps (not micro-batches). E.g., 500 optimizer steps, not 500 iterations with accumulation.
Gradient clipping: 0.5–1.0 commonly stabilizes fine-tuning.
Mixed precision + gradient checkpointing can reduce accumulation needs.

Common mistakes and self-checks

Mistake: Not dividing loss by accumulation steps. Self-check: Log effective LR and gradient norms; if they explode after enabling accumulation, you likely forgot this.
Mistake: Scheduling LR/warmup per micro-batch. Self-check: Ensure scheduler.step() runs only after optimizer.step().
Mistake: Zeroing grads every micro-batch. Self-check: Grads should accumulate across micro-batches; only zero after optimizer.step.
Mistake: Mismatched global batch between single- and multi-GPU runs. Self-check: Recompute using the formula whenever devices change.
Mistake: Assuming BatchNorm will behave the same with tiny micro-batches. Note: Transformers use LayerNorm, so this is less of an issue in NLP.

Quick self-diagnostics checklist

Loss divided by accumulation_steps (or gradients averaged) before backward
optimizer.step() and scheduler.step() only after N micro-batches
optimizer.zero_grad() immediately after optimizer.step()
Global batch recomputed when changing devices or micro-batch
Warmup and logging defined in optimizer steps

Exercises

Do these, then check solutions below. The same tasks appear in the Exercises panel for tracking.

Exercise 1 — Compute accumulation steps and LR

You have 1 GPU. You can fit micro-batch = 8. You want global batch = 128. Your baseline learning rate for global 32 was 2e-5. Find accumulation_steps and a reasonable new LR.

Hint

Use the global batch formula.
LR often scales linearly with global batch (verify stability).

Exercise 2 — Write the accumulation loop

Write a minimal training loop that correctly implements gradient accumulation with loss scaling, gradient clipping, and stepping the scheduler only after an optimizer step.

Hint

Divide loss by accumulation_steps before backward.
Zero grads only after optimizer.step().

Checklist before you run

Loss division or gradient averaging in place
Optimizer and scheduler step frequency aligned
Gradient clipping after unscale (if AMP)
Logging based on optimizer steps

Mini challenge

Your NER fine-tuning OOMs at sequence length 512. You currently use micro 8, accum 4, 1 GPU, LR 3e-5. Propose a plan to keep global batch ≈ 64 while avoiding OOM, and list two checks to ensure training stays stable.

Possible approach

Reduce micro to 2, increase accum to 32 (global = 2 × 32 × 1 = 64).
Enable mixed precision and gradient checkpointing.
Keep LR similar; verify with a short LR range test.
Monitor gradient norms and validation loss for stability.

Who this is for

NLP Engineers fine-tuning or pretraining Transformer models.
Data Scientists moving from CPU-only prototypes to GPU training.

Prerequisites

Basic PyTorch or similar DL framework knowledge.
Understanding of training loops, optimizers, and LR scheduling.

Learning path

Start: Batch size and gradient accumulation (this page).
Next: Mixed precision training and gradient scaling.
Then: Distributed data parallel and gradient checkpointing.
Finally: Scheduling strategies and throughput optimization.

Practical projects

Project 1: Fine-tune a Transformer classifier while sweeping micro-batch and accumulation to match a target global batch; plot loss vs. wall-clock time.
Project 2: Long-sequence QA: compare 128 vs. 512 token contexts; adjust accumulation to hold global batch constant; report convergence and stability.
Project 3: Multi-GPU replication: reproduce single-GPU results on 2–4 GPUs by recomputing accumulation and verifying identical validation metrics.

Next steps

Adopt mixed precision and gradient checkpointing to reduce memory.
Profile data pipelines to avoid starving GPUs.
Experiment with LR scaling rules and warmup in optimizer steps.

Quick Test

The quick test below is available to everyone. Only logged-in users will see saved progress on their dashboard.

Menu

Batch Size And Gradient Accumulation

Table of Contents