Why this skill matters for NLP Engineers
Transformers like BERT, GPT, and T5 power modern NLP. As an NLP Engineer, you will map business problems to model families, fine-tune efficiently on limited data, and deploy models that are reproducible and cost-effective. Mastering fine-tuning, prompt strategies, and parameter-efficient methods (like LoRA) unlocks tasks such as classification, NER, summarization, question answering, and generation.
What you will learn
- Pick the right base model (BERT-like, GPT-like, T5/BART) for your task and constraints.
- Fine-tune for classification, token-level tasks (NER), and seq2seq tasks (summarization, translation).
- Use prompting and few-shot techniques; decide when fine-tuning is worth it.
- Apply parameter-efficient tuning (LoRA) to cut compute and memory costs.
- Manage seeds, checkpoints, and configs to ensure reproducibility.
Who this is for
- Junior to mid-level ML/NLP practitioners moving beyond basic pipelines.
- Data scientists transitioning into production NLP roles.
- Engineers shipping text understanding or generation features.
Prerequisites
- Python basics and PyTorch familiarity.
- Comfort with datasets, train/val/test splits, and evaluation metrics.
- Basic understanding of attention and tokens/subwords.
Learning path
1) Model families
Understand BERT (encoder), GPT (decoder), and T5/BART (encoder–decoder) and their best-fit tasks.
2) Choose a base model
Match models to dataset size, latency, and licensing constraints.
3) Fine-tune core tasks
Implement classification, NER, and seq2seq fine-tuning with standard trainers.
4) Prompting vs fine-tuning
Prototype with prompts; escalate to fine-tuning when needed.
5) Parameter-efficient tuning
Apply LoRA to reduce GPU memory and speed up training.
6) Reproducibility
Save seeds, checkpoints, configs, and training metadata.
Roadmap with practical milestones
- Milestone 1: Run a zero-shot or few-shot baseline with prompting.
- Milestone 2: Fine-tune a compact encoder (e.g., MiniLM/DistilBERT) for text classification; hit >85% accuracy on a clean dataset.
- Milestone 3: Train an NER model and verify label alignment and F1 >80%.
- Milestone 4: Fine-tune a T5/BART summarizer with ROUGE evaluation.
- Milestone 5: Re-run experiments deterministically and reproduce results within a small tolerance.
- Milestone 6: Apply LoRA and demonstrate comparable performance with lower compute cost.
Worked examples
Example 1: Quick baseline via prompting (zero/few-shot)
from transformers import pipeline
classifier = pipeline("text-classification", model="facebook/bart-large-mnli")
# Zero-shot classification using NLI as proxy
result = classifier(
"The restaurant was slow but the food was excellent.",
candidate_labels=["positive", "negative", "neutral"]
)
print(result)
Why: Establish a quick baseline and decide if fine-tuning is necessary.
Example 2: Fine-tune a text classifier (encoder-only)
import torch
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, set_seed)
set_seed(42)
name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(name)
ds = load_dataset("imdb")
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)
tok = ds.map(tokenize, batched=True)
tok = tok.rename_column("label", "labels")
tok.set_format(type="torch", columns=["input_ids","attention_mask","labels"])
args = TrainingArguments(
output_dir="./ckpt-cls",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=2,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model="accuracy"
)
from evaluate import load as load_metric
acc = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = logits.argmax(-1)
return acc.compute(predictions=preds, references=labels)
trainer = Trainer(
model=model,
args=args,
train_dataset=tok["train"].select(range(20000)),
eval_dataset=tok["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("./ckpt-cls/best")
Tip: Start with a small subset to iterate quickly; then scale.
Example 3: Token classification (NER) with subword alignment
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
DataCollatorForTokenClassification, TrainingArguments, Trainer)
name = "bert-base-cased"
model = AutoModelForTokenClassification.from_pretrained(name, num_labels=9)
tokenizer = AutoTokenizer.from_pretrained(name)
ds = load_dataset("conll2003")
label_list = ds["train"].features["ner_tags"].feature.names
def tokenize_and_align(examples):
tokenized = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, word_ids in enumerate(tokenized.word_ids(batch_index=None)):
word_ids = tokenized.word_ids(i)
label_ids = []
previous = None
for wid in word_ids:
if wid is None:
label_ids.append(-100)
elif wid != previous:
label_ids.append(examples["ner_tags"][i][wid])
previous = wid
else:
# set to I- tag or same tag; simplest: repeat same id for subwords
label_ids.append(examples["ner_tags"][i][wid])
labels.append(label_ids)
tokenized["labels"] = labels
return tokenized
tok = ds.map(tokenize_and_align, batched=True)
collator = DataCollatorForTokenClassification(tokenizer)
args = TrainingArguments(
output_dir="./ckpt-ner",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=3e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tok["train"],
eval_dataset=tok["validation"],
tokenizer=tokenizer,
data_collator=collator,
)
trainer.train()
Key: Mask ignored tokens with -100 and align labels to subwords.
Example 4: Seq2seq fine-tuning (summarization with T5)
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
DataCollatorForSeq2Seq, TrainingArguments, Trainer)
name = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
ds = load_dataset("xsum")
def preprocess(batch):
inputs = ["summarize: " + x for x in batch["document"]]
model_inputs = tokenizer(inputs, max_length=512, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(batch["summary"], max_length=64, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tok = ds.map(preprocess, batched=True)
collator = DataCollatorForSeq2Seq(tokenizer, model=model)
args = TrainingArguments(
output_dir="./ckpt-t5",
evaluation_strategy="steps",
eval_steps=1000,
save_steps=1000,
learning_rate=3e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=8,
num_train_epochs=1,
predict_with_generate=True,
fp16=True
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tok["train"].select(range(5000)),
eval_dataset=tok["validation"].select(range(1000)),
tokenizer=tokenizer,
data_collator=collator,
)
trainer.train()
Seq2seq needs generation-aware evaluation and proper max lengths.
Example 5: Parameter-efficient tuning with LoRA
# Requires the 'peft' library
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
name = "gpt2"
base = AutoModelForCausalLM.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, target_modules=["c_attn"], bias="none")
model = get_peft_model(base, config)
text = ["Q: What is LoRA? A:", "Q: Define transfer learning. A:"]
labels = text
def tokenize(examples):
tokens = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
ds = load_dataset("json", data_files={"train": [{"text": t} for t in text]})
ds = ds.map(tokenize, batched=True)
ds.set_format(type="torch", columns=["input_ids","attention_mask","labels"])
args = TrainingArguments(
output_dir="./ckpt-lora",
learning_rate=1e-4,
per_device_train_batch_size=8,
num_train_epochs=3,
logging_steps=10
)
trainer = Trainer(model=model, args=args, train_dataset=ds["train"], tokenizer=tokenizer)
trainer.train()
model.save_pretrained("./ckpt-lora/final")
LoRA adapts small matrices on top of frozen weights, reducing memory and speeding up training.
Example 6: Reproducibility and checkpointing
import os, json, random, numpy as np, torch
from transformers import set_seed
# 1) Set seeds
seed = 1234
set_seed(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
# 2) Save run config
run_cfg = {
"seed": seed,
"model": "distilbert-base-uncased",
"lr": 2e-5,
"epochs": 2,
"max_length": 256
}
os.makedirs("./ckpt-meta", exist_ok=True)
with open("./ckpt-meta/run_config.json", "w") as f:
json.dump(run_cfg, f, indent=2)
# 3) After training, save tokenizer + model
# tokenizer.save_pretrained("./ckpt-meta")
# model.save_pretrained("./ckpt-meta")
# 4) Keep a frozen copy of validation predictions for later diffing
# np.save("./ckpt-meta/val_preds.npy", preds)
Goal: make it easy to recreate, compare, and audit runs.
Drills and exercises
- Run a zero-shot baseline on your dataset; record accuracy/F1 and a 2–3 sentence error analysis.
- Fine-tune a small encoder for classification; sweep learning rates [1e-5, 2e-5, 5e-5] and compare results.
- Implement NER with proper subword label alignment; verify no label leakage in padding tokens.
- Fine-tune a T5 model on a small summarization subset; compute ROUGE-1/2/L.
- Repeat an experiment with the same seed and confirm near-identical metrics and predictions.
- Apply LoRA to a causal LM and measure GPU memory vs full fine-tuning.
Common mistakes and debugging tips
Mistake: Misaligned labels in token tasks
Symptoms: poor F1, warnings about ignored indices. Fix: use word_ids to map word-level labels to subwords; set -100 for non-first subword tokens or follow your scheme consistently.
Mistake: Overfitting during fine-tuning
Symptoms: training accuracy climbs while validation stalls. Fix: lower learning rate, add weight decay, increase dropout, early stop, or freeze lower layers.
Mistake: Truncating important content
Symptoms: seq2seq outputs miss key info. Fix: increase max_length for inputs, use sliding window or Longformer/LED for long docs.
Mistake: Non-reproducible results
Symptoms: metric swings across runs. Fix: set global seeds, fix data splits, log hyperparams, save checkpoints and predictions, control data shuffling.
Debugging playbook
- Sanity-check on 100 samples; ensure loss decreases.
- Overfit a tiny subset (e.g., 64 examples) to near-zero loss; if it fails, inspect data/labels.
- Print a few decoded inputs/labels to verify truncation and tokenization.
- Log class distribution; consider class weights or resampling if imbalanced.
- Use gradient clipping (e.g., 1.0) to stabilize training.
Mini project: Unified customer feedback assistant
Build an end-to-end pipeline:
- Task 1: Classify feedback into categories (bug, feature, praise, urgent).
- Task 2: Extract entities (product, version, platform) with NER.
- Task 3: Generate a one-sentence summary or suggested reply.
Requirements:
- Start with a zero-shot baseline; then fine-tune for each task.
- Apply LoRA to reduce compute for the generation model.
- Save seeds, configs, and checkpoints; rerun to verify reproducibility.
- Deliver a short report with metrics (accuracy/F1/ROUGE) and 5 example outputs.
Implementation hints
- Use an encoder (DistilBERT/MiniLM) for classification and NER; T5-small or BART-base for summarization.
- Share a single tokenizer per model family when possible; pin versions.
- Keep inference batch sizes small to meet latency targets.
Practical projects
- Support ticket triage: intent classification + NER for routing.
- Meeting notes summarizer: seq2seq summarization with length control.
- FAQ assistant: few-shot prompting baseline, upgraded to fine-tuned retrieval-augmented generation.
Subskills
Model Families: BERT, GPT, T5
Understand encoder vs decoder vs encoder–decoder and their trade-offs.
Choosing Base Model For Task
Balance accuracy, size, latency, and licensing constraints.
Fine-Tuning For Classification
Build robust text classifiers with evaluation and regularization.
Fine-Tuning For Token Tasks (NER)
Handle subword alignment, span reconstruction, and masking.
Fine-Tuning For Seq2seq Tasks
Prepare inputs/targets, generation configs, and metrics.
Prompting Versus Fine-Tuning Tradeoffs
Prototype quickly; fine-tune when you need accuracy, control, or privacy.
Parameter-Efficient Tuning (LoRA)
Adapt large models on modest hardware with minimal memory.
Managing Checkpoints And Reproducibility
Lock seeds, save configs, and make runs auditable.
Next steps
- Choose one of the practical projects and scope a 1–2 week MVP.
- Create a reproducible run template that sets seeds, logs hyperparams, and saves checkpoints.
- Attempt LoRA on a model at least 2Ă— larger than your baseline to experience the trade-offs.