Who this is for
- Junior to mid-level NLP engineers who need production-ready text classifiers (sentiment, intent, topic).
- ML practitioners moving from traditional models to transformers.
- Data scientists who want robust baselines that are easy to iterate and deploy.
Prerequisites
- Python basics and ability to read stack traces.
- Intro to PyTorch or a high-level library (e.g., Transformers).
- Understanding of classification metrics: accuracy, precision, recall, F1.
Why this matters
As an NLP Engineer you will routinely:
- Classify support tickets to the right team (intent routing).
- Moderate user content (toxicity, hate speech, spam).
- Summarize user feedback with labels (sentiment, aspect category).
- Tag emails or chats to trigger automations (lead qualification, escalation).
Fine-tuning transformer encoders (e.g., BERT, RoBERTa, DistilBERT) gives you strong baselines with minimal feature engineering and fast iteration cycles.
Concept explained simply
A transformer pre-trained on massive text already understands word relationships. You add a small classification head on top and train it on your labeled examples. The model learns to map its rich text representation to your labels.
Mental model
Imagine a Swiss Army knife (the pre-trained model). Fine-tuning chooses one tool (a small classifier head) and practices using it for your task until it becomes skilled.
Learning path
Worked examples
Example 1: Binary sentiment classification (DistilBERT)
import numpy as np
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
# 1) Toy data
train_texts = [
"Amazing product, loved it!",
"Terrible quality, very disappointed",
"Works as expected",
"Not worth the price",
"I would buy again",
]
train_labels = [1, 0, 1, 0, 1] # 1=positive, 0=negative
val_texts = ["Absolutely fantastic", "Bad purchase"]
val_labels = [1, 0]
ds = DatasetDict({
"train": Dataset.from_dict({"text": train_texts, "label": train_labels}),
"validation": Dataset.from_dict({"text": val_texts, "label": val_labels})
})
# 2) Tokenize
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
tok_ds = ds.map(tokenize, batched=True)
# 3) Model
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=2)
# 4) Metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}
# 5) Train
args = TrainingArguments(
output_dir="out-sentiment",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=3,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model="f1"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tok_ds["train"],
eval_dataset=tok_ds["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Notes:
- Use macro-F1 instead when classes are imbalanced.
- Set a random seed to improve reproducibility.
Example 2: Multi-class intent (10 classes) with class imbalance
import torch
import numpy as np
from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
# Assume you have intents 0..9 and imbalanced counts
train_texts = ["reset password", "refund request", "track order", "cancel order", ...]
train_labels = [2, 5, 1, 5, ...] # 10 classes
val_texts = ["where is my package?", "how to login?"]
val_labels = [1, 2]
# Build datasets
train_ds = Dataset.from_dict({"text": train_texts, "label": train_labels})
val_ds = Dataset.from_dict({"text": val_texts, "label": val_labels})
ds = DatasetDict({"train": train_ds, "validation": val_ds})
model_ckpt = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tok(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
tok_ds = ds.map(tok, batched=True)
num_labels = 10
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels)
# Compute class weights from training labels
labels_tensor = torch.tensor(train_labels)
class_counts = torch.bincount(labels_tensor, minlength=num_labels).float()
class_weights = (1.0 / (class_counts + 1e-6))
class_weights = class_weights / class_weights.mean()
class_weights = class_weights.to(torch.float32)
# Custom Trainer to apply weighted loss
from transformers import Trainer
class WeightedTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
outputs = model(**inputs)
logits = outputs.get("logits")
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights.to(logits.device))
loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
def metrics(p):
logits, labels = p
preds = np.argmax(logits, axis=-1)
acc = accuracy_score(labels, preds)
prec, rec, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
return {"accuracy": acc, "macro_f1": f1, "macro_precision": prec, "macro_recall": rec}
args = TrainingArguments(
output_dir="out-intent",
learning_rate=3e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=4,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="macro_f1",
)
trainer = WeightedTrainer(
model=model,
args=args,
train_dataset=tok_ds["train"],
eval_dataset=tok_ds["validation"],
tokenizer=tokenizer,
compute_metrics=metrics,
)
trainer.train()
Tip: Use macro-F1 for multi-class imbalance to value all classes equally.
Example 3: Multi-label toxicity (zero, one, or many labels)
import numpy as np
import torch
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import f1_score, precision_recall_fscore_support
# Each sample has a binary vector of labels (e.g., [toxic, threat, insult])
train_texts = ["you are awful", "great work", "i will find you", "nice post"]
train_labels = [[1,0,1], [0,0,0], [1,1,0], [0,0,0]]
val_texts = ["stupid idea", "thank you"]
val_labels = [[1,0,0], [0,0,0]]
ds = DatasetDict({
"train": Dataset.from_dict({"text": train_texts, "labels": train_labels}),
"validation": Dataset.from_dict({"text": val_texts, "labels": val_labels})
})
ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
def tok(batch):
x = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
x["labels"] = batch["labels"]
return x
tok_ds = ds.map(tok, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(
ckpt, num_labels=3, problem_type="multi_label_classification"
)
# Metrics for multi-label: threshold sigmoid outputs at 0.5
def compute_metrics(p):
logits, labels = p
probs = 1 / (1 + np.exp(-logits))
preds = (probs >= 0.5).astype(int)
micro = f1_score(labels, preds, average="micro", zero_division=0)
macro = f1_score(labels, preds, average="macro", zero_division=0)
return {"micro_f1": micro, "macro_f1": macro}
args = TrainingArguments(
output_dir="out-toxicity",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=3,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="micro_f1",
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tok_ds["train"],
eval_dataset=tok_ds["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Tip: Tune the decision threshold per label (e.g., 0.3–0.6) to optimize F1 on validation.
Hyperparameters that matter
- Learning rate: start small (2e-5 to 5e-5). Too high causes instability; too low slows learning.
- Epochs: 3–5 is common; monitor early stopping by validation F1.
- Batch size: 16–32. Use gradient accumulation if GPU memory is limited.
- Max sequence length: 128–256 usually good. Longer costs more compute and may not help.
- Warmup ratio: 0.05–0.1 stabilizes early updates.
- Weight decay: 0.01 reduces overfitting.
- Layer freezing: optionally freeze lower layers for tiny datasets; unfreeze later if underfitting.
Evaluation and error analysis
- Report: accuracy + F1 (macro-F1 for imbalance).
- Inspect per-class precision/recall to find blind spots.
- Build a confusion matrix (multi-class) or per-label PR curves (multi-label).
- Slice metrics by text length, language, user segment, or source.
- Manually review 20–50 top false positives/negatives; write rules or augment data based on patterns.
Exercises
Complete these hands-on tasks. After each, tick the checklist to self-verify.
-
Exercise 1 (ex1): Label mapping and tokenization sanity-check
What to do
- Create a label map for binary sentiment: {"neg":0, "pos":1}.
- Tokenize two sentences with max_length=16 and verify attention masks and truncation.
- Confirm that labels are integers and shapes align with inputs.
-
Exercise 2 (ex2): Fine-tune a small classifier
What to do
- Use a small sample dataset (e.g., 200 rows, 80/20 split).
- Fine-tune DistilBERT for 1–2 epochs.
- Track validation F1, and ensure it improves over a random baseline.
- [ ] I verified tokenization shapes (input_ids and attention_mask sizes match).
- [ ] I confirmed label mapping is consistent across splits.
- [ ] I ran training and saw training loss decreasing.
- [ ] I recorded validation metrics and noted at least one improvement action.
Practical projects
- Customer feedback router: multi-class intent classifier with 8–12 categories and macro-F1 target.
- Light content moderation: multi-label classifier for toxic categories with threshold tuning per label.
- Support prioritization: binary classifier for "urgent" vs "normal" using domain-specific data.
Common mistakes and how to self-check
- Label leakage: Overlapping texts across train/val. Self-check: hash texts and ensure no duplicates across splits.
- Wrong problem type: Treating multi-label as multi-class. Self-check: does any example have multiple labels? If yes, use multi-label setup.
- Poor thresholding (multi-label): Using 0.5 blindly. Self-check: search thresholds to maximize F1 on validation.
- Too long sequences: Max length 512 for short texts wastes compute. Self-check: plot length distribution; pick P95.
- Unstable training: LR too high. Self-check: try 5e-5 → 3e-5 → 2e-5 and compare metrics.
- Ignoring class imbalance: Self-check: compute class frequencies; consider weighted loss or stratified sampling.
Mini challenge
You are given short app reviews with labels {negative, neutral, positive}. Improve macro-F1 by at least +3 points over a DistilBERT baseline. Consider:
- Trying RoBERTa-base or domain-specific variants.
- Adjusting max_length (64 vs 128).
- Tuning learning rate (2e-5 to 5e-5) and epochs (2–5).
- Using class weights if neutral is underrepresented.
Next steps
- Move to more advanced tasks: sequence labeling (NER), QA, and contrastive learning for retrieval.
- Experiment with parameter-efficient fine-tuning (LoRA, adapters) for faster iteration.
- Take the quick test below to check your understanding. Note: anyone can take it; only logged-in users have saved progress.