luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Batch Size And Gradient Accumulation

Learn Batch Size And Gradient Accumulation for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

NLP models like BERT and Llama are memory-hungry. Choosing the right batch size and using gradient accumulation lets you:

  • Train larger models on limited GPUs without out-of-memory (OOM) errors.
  • Stabilize training by simulating a larger batch size for smoother gradients.
  • Speed up experimentation by balancing throughput, memory, and convergence.
  • Reproduce results across single- and multi-GPU setups using consistent effective batch size.

Concept explained simply

Two key ideas:

  • Micro-batch size: how many samples you process before calling backward().
  • Gradient accumulation: sum gradients over several micro-batches, then take one optimizer step. This simulates a larger batch without holding all samples in memory.
Formula: effective/global batch size

Global batch size = micro_batch_size × accumulation_steps × number_of_devices.

Example: micro 8 × accum 4 × 2 GPUs = global 64.

Mental model

Imagine pouring water (gradients) from small cups (micro-batches) into a bucket. When the bucket is full (after accumulation_steps), you pour it out once (optimizer.step). Bigger bucket = smoother updates.

When to scale the loss

To average gradients across accumulated micro-batches, divide the loss by accumulation_steps before calling backward() (or equivalently average gradients after).

Worked examples

Example 1: Single GPU fine-tuning BERT

Goal: global batch 64 on a single 12GB GPU. You can fit micro-batch 8.

  • Given: devices=1, micro=8, want global=64.
  • accumulation_steps = 64 / (8 × 1) = 8.
  • Call optimizer.step() every 8 micro-batches.

Example 2: Two GPUs with DDP

Goal: global 128. Each GPU fits micro 4.

  • Given: devices=2, micro=4, want global=128.
  • accumulation_steps = 128 / (4 × 2) = 16.
  • Each GPU processes 4 × 16 = 64 samples per optimizer step; combined = 128.

Example 3: Longer sequences force smaller micro-batch

Baseline: micro 16 at sequence length 128. You switch to length 512 and now only micro 4 fits. You want to keep global 128 on 1 GPU.

  • accumulation_steps = 128 / (4 × 1) = 32.
  • Consider mixed precision and gradient checkpointing to reduce steps if training is too slow.

How to apply (step-by-step)

  1. Pick a target global batch size that stabilizes training (typical NLP fine-tuning: 32–256).
  2. Find the largest micro-batch size that fits in memory without OOM.
  3. Compute accumulation_steps using the formula. Round up if not divisible.
  4. In your loop:
    • Call loss = loss / accumulation_steps, then loss.backward().
    • Every accumulation_steps iterations: clip gradients (optional), optimizer.step(), optimizer.zero_grad(), and update LR scheduler once.
  5. Log steps in terms of optimizer steps (not micro-batches).
Minimal PyTorch pattern
accum = 8
optimizer.zero_grad()
for step, batch in enumerate(loader, start=1):
    inputs, labels = batch
    with torch.cuda.amp.autocast(enabled=use_amp):
        loss = model(inputs, labels)
        loss = loss / accum
    scaler.scale(loss).backward()
    if step % accum == 0:
        if clip_val is not None:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip_val)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        optimizer.zero_grad()

Practical parameter cheatsheet

  • Global batch (fine-tuning Transformers): 32–256.
  • Micro-batch: as large as fits; often 4–32 for sequence length 128–512.
  • Accumulation steps: 1–32+ (higher if sequences are long).
  • Learning rate: increase roughly proportionally with global batch; verify with a short LR range test.
  • Warmup: define in optimizer steps (not micro-batches). E.g., 500 optimizer steps, not 500 iterations with accumulation.
  • Gradient clipping: 0.5–1.0 commonly stabilizes fine-tuning.
  • Mixed precision + gradient checkpointing can reduce accumulation needs.

Common mistakes and self-checks

  • Mistake: Not dividing loss by accumulation steps. Self-check: Log effective LR and gradient norms; if they explode after enabling accumulation, you likely forgot this.
  • Mistake: Scheduling LR/warmup per micro-batch. Self-check: Ensure scheduler.step() runs only after optimizer.step().
  • Mistake: Zeroing grads every micro-batch. Self-check: Grads should accumulate across micro-batches; only zero after optimizer.step.
  • Mistake: Mismatched global batch between single- and multi-GPU runs. Self-check: Recompute using the formula whenever devices change.
  • Mistake: Assuming BatchNorm will behave the same with tiny micro-batches. Note: Transformers use LayerNorm, so this is less of an issue in NLP.
Quick self-diagnostics checklist
  • Loss divided by accumulation_steps (or gradients averaged) before backward
  • optimizer.step() and scheduler.step() only after N micro-batches
  • optimizer.zero_grad() immediately after optimizer.step()
  • Global batch recomputed when changing devices or micro-batch
  • Warmup and logging defined in optimizer steps

Exercises

Do these, then check solutions below. The same tasks appear in the Exercises panel for tracking.

Exercise 1 — Compute accumulation steps and LR

You have 1 GPU. You can fit micro-batch = 8. You want global batch = 128. Your baseline learning rate for global 32 was 2e-5. Find accumulation_steps and a reasonable new LR.

Hint
  • Use the global batch formula.
  • LR often scales linearly with global batch (verify stability).

Exercise 2 — Write the accumulation loop

Write a minimal training loop that correctly implements gradient accumulation with loss scaling, gradient clipping, and stepping the scheduler only after an optimizer step.

Hint
  • Divide loss by accumulation_steps before backward.
  • Zero grads only after optimizer.step().
Checklist before you run
  • Loss division or gradient averaging in place
  • Optimizer and scheduler step frequency aligned
  • Gradient clipping after unscale (if AMP)
  • Logging based on optimizer steps

Mini challenge

Your NER fine-tuning OOMs at sequence length 512. You currently use micro 8, accum 4, 1 GPU, LR 3e-5. Propose a plan to keep global batch ≈ 64 while avoiding OOM, and list two checks to ensure training stays stable.

Possible approach
  • Reduce micro to 2, increase accum to 32 (global = 2 × 32 × 1 = 64).
  • Enable mixed precision and gradient checkpointing.
  • Keep LR similar; verify with a short LR range test.
  • Monitor gradient norms and validation loss for stability.

Who this is for

  • NLP Engineers fine-tuning or pretraining Transformer models.
  • Data Scientists moving from CPU-only prototypes to GPU training.

Prerequisites

  • Basic PyTorch or similar DL framework knowledge.
  • Understanding of training loops, optimizers, and LR scheduling.

Learning path

  • Start: Batch size and gradient accumulation (this page).
  • Next: Mixed precision training and gradient scaling.
  • Then: Distributed data parallel and gradient checkpointing.
  • Finally: Scheduling strategies and throughput optimization.

Practical projects

  • Project 1: Fine-tune a Transformer classifier while sweeping micro-batch and accumulation to match a target global batch; plot loss vs. wall-clock time.
  • Project 2: Long-sequence QA: compare 128 vs. 512 token contexts; adjust accumulation to hold global batch constant; report convergence and stability.
  • Project 3: Multi-GPU replication: reproduce single-GPU results on 2–4 GPUs by recomputing accumulation and verifying identical validation metrics.

Next steps

  • Adopt mixed precision and gradient checkpointing to reduce memory.
  • Profile data pipelines to avoid starving GPUs.
  • Experiment with LR scaling rules and warmup in optimizer steps.

Quick Test

The quick test below is available to everyone. Only logged-in users will see saved progress on their dashboard.

Practice Exercises

2 exercises to complete

Instructions

You have 1 GPU. You can fit micro-batch = 8. You want global batch = 128. Your baseline learning rate for global 32 was 2e-5. Compute:

  • accumulation_steps
  • a reasonable new LR using linear scaling
Expected Output
accumulation_steps = 16; new_lr ≈ 8e-5

Batch Size And Gradient Accumulation — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Batch Size And Gradient Accumulation?

AI Assistant

Ask questions about this tool