luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Fine Tuning For Classification

Learn Fine Tuning For Classification for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Who this is for

  • Junior to mid-level NLP engineers who need production-ready text classifiers (sentiment, intent, topic).
  • ML practitioners moving from traditional models to transformers.
  • Data scientists who want robust baselines that are easy to iterate and deploy.

Prerequisites

  • Python basics and ability to read stack traces.
  • Intro to PyTorch or a high-level library (e.g., Transformers).
  • Understanding of classification metrics: accuracy, precision, recall, F1.

Why this matters

As an NLP Engineer you will routinely:

  • Classify support tickets to the right team (intent routing).
  • Moderate user content (toxicity, hate speech, spam).
  • Summarize user feedback with labels (sentiment, aspect category).
  • Tag emails or chats to trigger automations (lead qualification, escalation).

Fine-tuning transformer encoders (e.g., BERT, RoBERTa, DistilBERT) gives you strong baselines with minimal feature engineering and fast iteration cycles.

Concept explained simply

A transformer pre-trained on massive text already understands word relationships. You add a small classification head on top and train it on your labeled examples. The model learns to map its rich text representation to your labels.

Mental model

Imagine a Swiss Army knife (the pre-trained model). Fine-tuning chooses one tool (a small classifier head) and practices using it for your task until it becomes skilled.

Learning path

Step 1: Frame the problem (binary vs multi-class vs multi-label). Define metrics and acceptance criteria.
Step 2: Prepare data (train/validation/test split, label mapping, handle class imbalance).
Step 3: Choose a base model (DistilBERT for speed, BERT/RoBERTa for accuracy, domain-specific if available).
Step 4: Tokenize with truncation/padding and a sensible max_length (e.g., 128–256).
Step 5: Fine-tune with a small learning rate (2e-5–5e-5), batch size 16–32, 3–5 epochs, AdamW, weight decay 0.01, warmup 5–10%.
Step 6: Evaluate (accuracy + F1; macro-F1 for class imbalance; per-class metrics; confusion matrix).
Step 7: Error analysis and iteration (threshold tuning for multi-label, augment data, adjust max_length, try better base model).

Worked examples

Example 1: Binary sentiment classification (DistilBERT)
import numpy as np
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# 1) Toy data
train_texts = [
  "Amazing product, loved it!",
  "Terrible quality, very disappointed",
  "Works as expected",
  "Not worth the price",
  "I would buy again",
]
train_labels = [1, 0, 1, 0, 1]  # 1=positive, 0=negative
val_texts = ["Absolutely fantastic", "Bad purchase"]
val_labels = [1, 0]

ds = DatasetDict({
  "train": Dataset.from_dict({"text": train_texts, "label": train_labels}),
  "validation": Dataset.from_dict({"text": val_texts, "label": val_labels})
})

# 2) Tokenize
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize(batch):
  return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tok_ds = ds.map(tokenize, batched=True)

# 3) Model
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=2)

# 4) Metrics

def compute_metrics(eval_pred):
  logits, labels = eval_pred
  preds = np.argmax(logits, axis=-1)
  precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
  acc = accuracy_score(labels, preds)
  return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# 5) Train
args = TrainingArguments(
  output_dir="out-sentiment",
  evaluation_strategy="epoch",
  save_strategy="epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=32,
  num_train_epochs=3,
  weight_decay=0.01,
  load_best_model_at_end=True,
  metric_for_best_model="f1"
)

trainer = Trainer(
  model=model,
  args=args,
  train_dataset=tok_ds["train"],
  eval_dataset=tok_ds["validation"],
  tokenizer=tokenizer,
  compute_metrics=compute_metrics,
)

trainer.train()

Notes:

  • Use macro-F1 instead when classes are imbalanced.
  • Set a random seed to improve reproducibility.
Example 2: Multi-class intent (10 classes) with class imbalance
import torch
import numpy as np
from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Assume you have intents 0..9 and imbalanced counts
train_texts = ["reset password", "refund request", "track order", "cancel order", ...]
train_labels = [2, 5, 1, 5, ...]  # 10 classes
val_texts = ["where is my package?", "how to login?"]
val_labels = [1, 2]

# Build datasets
train_ds = Dataset.from_dict({"text": train_texts, "label": train_labels})
val_ds = Dataset.from_dict({"text": val_texts, "label": val_labels})
ds = DatasetDict({"train": train_ds, "validation": val_ds})

model_ckpt = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tok(batch):
  return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tok_ds = ds.map(tok, batched=True)

num_labels = 10
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels)

# Compute class weights from training labels
labels_tensor = torch.tensor(train_labels)
class_counts = torch.bincount(labels_tensor, minlength=num_labels).float()
class_weights = (1.0 / (class_counts + 1e-6))
class_weights = class_weights / class_weights.mean()
class_weights = class_weights.to(torch.float32)

# Custom Trainer to apply weighted loss
from transformers import Trainer
class WeightedTrainer(Trainer):
  def compute_loss(self, model, inputs, return_outputs=False):
    labels = inputs.get("labels")
    outputs = model(**inputs)
    logits = outputs.get("logits")
    loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights.to(logits.device))
    loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
    return (loss, outputs) if return_outputs else loss

def metrics(p):
  logits, labels = p
  preds = np.argmax(logits, axis=-1)
  acc = accuracy_score(labels, preds)
  prec, rec, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
  return {"accuracy": acc, "macro_f1": f1, "macro_precision": prec, "macro_recall": rec}

args = TrainingArguments(
  output_dir="out-intent",
  learning_rate=3e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=32,
  num_train_epochs=4,
  evaluation_strategy="epoch",
  save_strategy="epoch",
  load_best_model_at_end=True,
  metric_for_best_model="macro_f1",
)

trainer = WeightedTrainer(
  model=model,
  args=args,
  train_dataset=tok_ds["train"],
  eval_dataset=tok_ds["validation"],
  tokenizer=tokenizer,
  compute_metrics=metrics,
)

trainer.train()

Tip: Use macro-F1 for multi-class imbalance to value all classes equally.

Example 3: Multi-label toxicity (zero, one, or many labels)
import numpy as np
import torch
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import f1_score, precision_recall_fscore_support

# Each sample has a binary vector of labels (e.g., [toxic, threat, insult])
train_texts = ["you are awful", "great work", "i will find you", "nice post"]
train_labels = [[1,0,1], [0,0,0], [1,1,0], [0,0,0]]
val_texts = ["stupid idea", "thank you"]
val_labels = [[1,0,0], [0,0,0]]

ds = DatasetDict({
  "train": Dataset.from_dict({"text": train_texts, "labels": train_labels}),
  "validation": Dataset.from_dict({"text": val_texts, "labels": val_labels})
})

ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(ckpt)

def tok(batch):
  x = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
  x["labels"] = batch["labels"]
  return x

tok_ds = ds.map(tok, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
  ckpt, num_labels=3, problem_type="multi_label_classification"
)

# Metrics for multi-label: threshold sigmoid outputs at 0.5

def compute_metrics(p):
  logits, labels = p
  probs = 1 / (1 + np.exp(-logits))
  preds = (probs >= 0.5).astype(int)
  micro = f1_score(labels, preds, average="micro", zero_division=0)
  macro = f1_score(labels, preds, average="macro", zero_division=0)
  return {"micro_f1": micro, "macro_f1": macro}

args = TrainingArguments(
  output_dir="out-toxicity",
  learning_rate=2e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=32,
  num_train_epochs=3,
  evaluation_strategy="epoch",
  save_strategy="epoch",
  load_best_model_at_end=True,
  metric_for_best_model="micro_f1",
)

trainer = Trainer(
  model=model,
  args=args,
  train_dataset=tok_ds["train"],
  eval_dataset=tok_ds["validation"],
  tokenizer=tokenizer,
  compute_metrics=compute_metrics,
)

trainer.train()

Tip: Tune the decision threshold per label (e.g., 0.3–0.6) to optimize F1 on validation.

Hyperparameters that matter

  • Learning rate: start small (2e-5 to 5e-5). Too high causes instability; too low slows learning.
  • Epochs: 3–5 is common; monitor early stopping by validation F1.
  • Batch size: 16–32. Use gradient accumulation if GPU memory is limited.
  • Max sequence length: 128–256 usually good. Longer costs more compute and may not help.
  • Warmup ratio: 0.05–0.1 stabilizes early updates.
  • Weight decay: 0.01 reduces overfitting.
  • Layer freezing: optionally freeze lower layers for tiny datasets; unfreeze later if underfitting.

Evaluation and error analysis

  • Report: accuracy + F1 (macro-F1 for imbalance).
  • Inspect per-class precision/recall to find blind spots.
  • Build a confusion matrix (multi-class) or per-label PR curves (multi-label).
  • Slice metrics by text length, language, user segment, or source.
  • Manually review 20–50 top false positives/negatives; write rules or augment data based on patterns.

Exercises

Complete these hands-on tasks. After each, tick the checklist to self-verify.

  1. Exercise 1 (ex1): Label mapping and tokenization sanity-check
    What to do
    • Create a label map for binary sentiment: {"neg":0, "pos":1}.
    • Tokenize two sentences with max_length=16 and verify attention masks and truncation.
    • Confirm that labels are integers and shapes align with inputs.
  2. Exercise 2 (ex2): Fine-tune a small classifier
    What to do
    • Use a small sample dataset (e.g., 200 rows, 80/20 split).
    • Fine-tune DistilBERT for 1–2 epochs.
    • Track validation F1, and ensure it improves over a random baseline.
  • [ ] I verified tokenization shapes (input_ids and attention_mask sizes match).
  • [ ] I confirmed label mapping is consistent across splits.
  • [ ] I ran training and saw training loss decreasing.
  • [ ] I recorded validation metrics and noted at least one improvement action.

Practical projects

  • Customer feedback router: multi-class intent classifier with 8–12 categories and macro-F1 target.
  • Light content moderation: multi-label classifier for toxic categories with threshold tuning per label.
  • Support prioritization: binary classifier for "urgent" vs "normal" using domain-specific data.

Common mistakes and how to self-check

  • Label leakage: Overlapping texts across train/val. Self-check: hash texts and ensure no duplicates across splits.
  • Wrong problem type: Treating multi-label as multi-class. Self-check: does any example have multiple labels? If yes, use multi-label setup.
  • Poor thresholding (multi-label): Using 0.5 blindly. Self-check: search thresholds to maximize F1 on validation.
  • Too long sequences: Max length 512 for short texts wastes compute. Self-check: plot length distribution; pick P95.
  • Unstable training: LR too high. Self-check: try 5e-5 → 3e-5 → 2e-5 and compare metrics.
  • Ignoring class imbalance: Self-check: compute class frequencies; consider weighted loss or stratified sampling.

Mini challenge

You are given short app reviews with labels {negative, neutral, positive}. Improve macro-F1 by at least +3 points over a DistilBERT baseline. Consider:

  • Trying RoBERTa-base or domain-specific variants.
  • Adjusting max_length (64 vs 128).
  • Tuning learning rate (2e-5 to 5e-5) and epochs (2–5).
  • Using class weights if neutral is underrepresented.

Next steps

  • Move to more advanced tasks: sequence labeling (NER), QA, and contrastive learning for retrieval.
  • Experiment with parameter-efficient fine-tuning (LoRA, adapters) for faster iteration.
  • Take the quick test below to check your understanding. Note: anyone can take it; only logged-in users have saved progress.

Practice Exercises

2 exercises to complete

Instructions

Create a label map {"neg":0, "pos":1}. Take two sentences: "I loved it!" (pos), "Not good" (neg). Tokenize with a transformer tokenizer at max_length=16 with truncation and padding. Verify that:

  • input_ids and attention_mask are length 16.
  • Labels are integers [1, 0] matching the two sentences.
  • Special tokens ([CLS], [SEP]) appear as expected for your model family.
Expected Output
Two feature dictionaries with input_ids and attention_mask of length 16 each; labels [1, 0]. No errors during batching.

Fine Tuning For Classification — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Fine Tuning For Classification?

AI Assistant

Ask questions about this tool