Skill Not Found

Why this skill matters for an NLP Engineer

Training and optimization turn ideas into high-performing NLP models. As an NLP Engineer, you will fine-tune transformers, optimize data pipelines, reduce GPU cost, and ship models that converge faster and generalize better. Mastering this skill lets you:

Cut training time and cost with efficient loops, batching, and mixed precision.
Improve model quality using tuning, early stopping, and robust evaluation.
Scale to large datasets/teams with distributed training and reproducible configs.

What you will learn

Build fast, fault-tolerant training loops with clear logging and checkpoints.
Use mixed precision safely to speed up training without hurting accuracy.
Grow effective batch size via gradient accumulation and tune it for stability.
Apply early stopping and model selection to avoid overfitting.
Run basic hyperparameter searches that actually move metrics.
Start distributed training the right way and keep it reproducible.

Prerequisites

Comfortable with Python and PyTorch (modules, optimizers, DataLoader).
Know standard NLP tasks (classification, token classification, seq2seq).
Basic GPU concepts (CUDA device, VRAM limits).

Learning path (milestones)

Efficient training loop: Clean epochs, metrics, logging, checkpointing.
Batching & accumulation: Increase throughput while staying within VRAM.
Mixed precision: Autocast + GradScaler for faster training.
Early stopping & selection: Track validation and save the best.
Hyperparameter tuning: Small, disciplined search that fits your budget.
Distributed training basics: DDP setup and common pitfalls.
Reproducibility: Seeds, configs, and deterministic flags.
GPU cost management: Monitor utilization, schedule jobs, avoid idle time.

Worked examples

1) A clean, efficient PyTorch training loop (NLP classification)

import torch, time
from torch import nn
from torch.utils.data import DataLoader

# Dummy dataset: list of (input_ids, attention_mask, label)
class ToyTextDataset(torch.utils.data.Dataset):
    def __init__(self, n=1024, seq_len=128):
        self.X = torch.randint(0, 1000, (n, seq_len))
        self.attn = torch.ones(n, seq_len)
        self.y = torch.randint(0, 2, (n,))
    def __len__(self): return len(self.y)
    def __getitem__(self, i):
        return {
            'input_ids': self.X[i],
            'attention_mask': self.attn[i],
            'labels': self.y[i]
        }

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Simple model (pretend embedding + mean-pool + classifier)
class TinyClassifier(nn.Module):
    def __init__(self, vocab=1000, d_model=128):
        super().__init__()
        self.emb = nn.Embedding(vocab, d_model)
        self.fc = nn.Linear(d_model, 2)
    def forward(self, input_ids, attention_mask=None, labels=None):
        x = self.emb(input_ids)  # [B, L, D]
        x = (x.mean(dim=1))      # mean-pool -> [B, D]
        logits = self.fc(x)
        loss = None
        if labels is not None:
            loss = nn.CrossEntropyLoss()(logits, labels)
        return logits, loss

train_dl = DataLoader(ToyTextDataset(2048), batch_size=64, shuffle=True, pin_memory=True)
val_dl   = DataLoader(ToyTextDataset(512),  batch_size=64, shuffle=False, pin_memory=True)

model = TinyClassifier().to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=5)

best_val = float('inf')
best_path = 'best.pt'

for epoch in range(5):
    model.train()
    t0 = time.time()
    total_loss = 0.0
    for batch in train_dl:
        ids = batch['input_ids'].to(device, non_blocking=True)
        mask = batch['attention_mask'].to(device, non_blocking=True)
        y = batch['labels'].to(device, non_blocking=True)

        opt.zero_grad(set_to_none=True)
        _, loss = model(ids, mask, y)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        opt.step()
        total_loss += loss.item()

    scheduler.step()

    # Validation
    model.eval()
    val_loss, correct, count = 0.0, 0, 0
    with torch.inference_mode():
        for batch in val_dl:
            ids = batch['input_ids'].to(device, non_blocking=True)
            mask = batch['attention_mask'].to(device, non_blocking=True)
            y = batch['labels'].to(device, non_blocking=True)
            logits, loss = model(ids, mask, y)
            val_loss += loss.item()
            pred = logits.argmax(dim=-1)
            correct += (pred == y).sum().item()
            count += y.numel()

    avg_train = total_loss / len(train_dl)
    avg_val = val_loss / len(val_dl)
    acc = correct / count

    print(f"Epoch {epoch+1}: train {avg_train:.4f} | val {avg_val:.4f} | acc {acc:.3f} | time {time.time()-t0:.1f}s")

    # Checkpoint best
    if avg_val < best_val:
        best_val = avg_val
        torch.save({'model': model.state_dict(), 'val_loss': best_val}, best_path)
        print('Saved best checkpoint')

Notes: use pin_memory, non_blocking transfers, set_to_none=True for zero_grad, and gradient clipping for stability.

2) Mixed precision with autocast and GradScaler

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler(enabled=torch.cuda.is_available())

for epoch in range(5):
    model.train()
    for batch in train_dl:
        ids = batch['input_ids'].to(device, non_blocking=True)
        mask = batch['attention_mask'].to(device, non_blocking=True)
        y = batch['labels'].to(device, non_blocking=True)

        opt.zero_grad(set_to_none=True)
        with autocast(enabled=torch.cuda.is_available()):
            _, loss = model(ids, mask, y)
        scaler.scale(loss).backward()
        scaler.unscale_(opt)
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(opt)
        scaler.update()

Use autocast on forward; scale/unscale around backward/optimizer steps. Disable AMP on CPUs or if it causes instability.

3) Gradient accumulation for larger effective batch size

accum_steps = 4  # effective_batch = batch_size * accum_steps
step = 0

for epoch in range(3):
    model.train()
    opt.zero_grad(set_to_none=True)
    for batch in train_dl:
        ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        y = batch['labels'].to(device)

        with autocast(enabled=torch.cuda.is_available()):
            _, loss = model(ids, mask, y)
            loss = loss / accum_steps
        scaler.scale(loss).backward()

        if (step + 1) % accum_steps == 0:
            scaler.unscale_(opt)
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt)
            scaler.update()
            opt.zero_grad(set_to_none=True)
        step += 1

Divide loss by accum_steps to keep gradient magnitudes consistent.

4) Early stopping and best-model selection

patience = 2
best_val = float('inf')
wait = 0

for epoch in range(20):
    # train ...
    # compute avg_val
    if avg_val + 1e-6 < best_val:
        best_val = avg_val
        torch.save({'model': model.state_dict(), 'epoch': epoch, 'val': best_val}, 'best.pt')
        wait = 0
        print('New best, saved!')
    else:
        wait += 1
        if wait >= patience:
            print('Early stopping triggered')
            break

# Load best for final evaluation
state = torch.load('best.pt', map_location=device)
model.load_state_dict(state['model'])

Patience prevents overfitting and saves compute. Always reload the best checkpoint before final reporting.

5) Small, budget-aware hyperparameter search

import itertools, math

search_space = {
    'lr': [5e-5, 1e-4, 3e-4],
    'batch': [16, 32],
    'wd': [0.0, 0.01]
}

trials = list(itertools.product(search_space['lr'], search_space['batch'], search_space['wd']))
results = []

for lr, bsz, wd in trials:
    # Re-init model/opt each trial for fairness
    model = TinyClassifier().to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    train_dl = DataLoader(ToyTextDataset(1024), batch_size=bsz, shuffle=True)

    # short training for comparison
    for epoch in range(2):
        model.train()
        for batch in train_dl:
            ids = batch['input_ids'].to(device)
            mask = batch['attention_mask'].to(device)
            y = batch['labels'].to(device)
            opt.zero_grad(set_to_none=True)
            _, loss = model(ids, mask, y)
            loss.backward()
            opt.step()

    # quick val
    model.eval()
    val_loss = 0.0
    with torch.inference_mode():
        for batch in val_dl:
            ids = batch['input_ids'].to(device)
            mask = batch['attention_mask'].to(device)
            y = batch['labels'].to(device)
            _, loss = model(ids, mask, y)
            val_loss += loss.item()
    avg_val = val_loss / len(val_dl)
    results.append({'lr': lr, 'batch': bsz, 'wd': wd, 'val': avg_val})
    print(results[-1])

best = min(results, key=lambda r: r['val'])
print('Best trial:', best)

Use short, consistent budgets to compare trials fairly.

6) DistributedDataParallel (DDP) skeleton

# Launch with: torchrun --nproc_per_node=NUM_GPUS train.py
import os, torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

local_rank = int(os.environ.get('LOCAL_RANK', 0))
dist.init_process_group('nccl')
torch.cuda.set_device(local_rank)

device = torch.device('cuda', local_rank)
model = TinyClassifier().to(device)
model = DDP(model, device_ids=[local_rank])

# Use DistributedSampler so each rank sees unique data subset
from torch.utils.data.distributed import DistributedSampler
train_ds = ToyTextDataset(4096)
train_sampler = DistributedSampler(train_ds)
train_dl = DataLoader(train_ds, batch_size=32, sampler=train_sampler)

for epoch in range(3):
    train_sampler.set_epoch(epoch)
    model.train()
    for batch in train_dl:
        ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        y = batch['labels'].to(device)
        opt.zero_grad(set_to_none=True)
        _, loss = model(ids, mask, y)
        loss.backward()
        opt.step()

dist.barrier()
dist.destroy_process_group()

Key points: use DistributedSampler, set_epoch each epoch, and set device with LOCAL_RANK.

Drills and exercises

Replace CrossEntropy with label smoothing and compare validation accuracy.
Benchmark training with and without pin_memory and non_blocking transfers.
Find the largest batch size that fits in GPU; then match it using accumulation.
Turn on AMP and verify identical or better validation metrics after full training.
Add gradient clipping and observe its effect on loss spikes.
Implement early stopping and show fewer epochs with similar final accuracy.
Run a 6–9 trial random search; record best hyperparameters and time spent.
Seed everything and confirm run-to-run stability (same metrics within tolerance).

Common mistakes and debugging tips

Forgetting model.eval(): Validation metrics drift. Always switch to eval and use inference_mode.
Mixed devices: Tensors on CPU and model on GPU cause slowdowns or errors. Move all inputs to the same device.
No loss scaling with AMP: Underflowed gradients. Use GradScaler.
Skipping DistributedSampler: Data duplication in DDP. Use sampler and call set_epoch each epoch.
Not averaging loss over accumulation: Exploding gradients. Divide loss by accum_steps.
Unbounded learning rate: Divergence. Start modest (e.g., 3e-4) and add schedulers.
Overfitting without early stopping: Keep best checkpoint and monitor validation loss, not just accuracy.

Quick debugging checklist

Print a single batch: shapes, dtypes, device.
Run 1 step with a tiny subset to verify forward/backward works.
Check gradients for NaN/Inf; lower LR or add clipping.
Profile data loading: if GPU utilization is low, increase num_workers or prefetch.
Log LR per step; confirm schedulers behave as expected.

Mini project: Fine-tune a small text classifier under a tight GPU budget

Goal: Train a compact classifier on a labeled CSV with two columns (text, label), keeping GPU usage predictable and cost-effective.

Data: Load your CSV, tokenize to fixed length, and create train/val splits.
Loop: Implement the efficient training template with logging and checkpoints.
Batching: Start with the largest batch that fits; then test gradient accumulation to match or exceed it.
Precision: Enable AMP; verify metrics vs. FP32.
Early stop: Use patience=2–3 and save the best model.
Tuning: Try 6–9 trials over lr, weight decay, and max sequence length.
Repro: Fix seeds and save a config file (JSON/YAML) with all hyperparameters.
Report: Summarize best config, final accuracy/F1, total training time, and GPU utilization notes.

Mini tasks

Add gradient clipping at max_norm=1.0 and compare stability.
Log GPU memory at the end of each epoch (torch.cuda.max_memory_allocated).
Export a small TorchScript or state_dict artifact with a versioned filename.

Subskills

Efficient Training Loops: Clean structure, logging, schedulers, and checkpoints for smooth runs.
Mixed Precision Basics: Use autocast and GradScaler safely to speed up training.
Batch Size And Gradient Accumulation: Increase throughput while fitting memory limits.
Checkpointing And Early Stopping: Save best models and stop before overfitting.
Hyperparameter Tuning Basics: Small, disciplined searches that improve validation metrics.
Distributed Training Basics: Start with DDP correctly and avoid data duplication.
Managing GPUs And Cost: Keep utilization high, reduce idle time, and track spend.
Reproducibility With Seeds And Configs: Stable results with fixed seeds and versioned configs.

Next steps

Adopt a simple config system to track experiments consistently.
Introduce better schedulers (OneCycle, Cosine with warmup) after basics are solid.
Explore parameter-efficient fine-tuning for larger language models to cut compute.
Move to multi-GPU DDP when single-GPU training is stable and reproducible.

Skill exam

Test your understanding with a short exam at the end of this page. Anyone can take it. If you log in, your progress and results will be saved.

Menu

Training And Optimization

Table of Contents

Why this skill matters for an NLP Engineer

What you will learn

Prerequisites

Learning path (milestones)

Worked examples

Drills and exercises

Common mistakes and debugging tips

Mini project: Fine-tune a small text classifier under a tight GPU budget

Subskills

Next steps

Skill exam

Training And Optimization — Skill Exam

Topics

Efficient Training Loops

Mixed Precision Basics

Batch Size And Gradient Accumulation

Checkpointing And Early Stopping

Hyperparameter Tuning Basics

Distributed Training Basics

Managing GPUs And Cost

Reproducibility With Seeds And Configs

Have questions about Training And Optimization?

AI Assistant