luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Fine Tuning And Transfer Learning

Learn Fine Tuning And Transfer Learning for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Why this matters

As an Applied Scientist, you rarely train models from scratch. You adapt strong pretrained models to your domain under constraints (limited labels, compute, and time). Transfer learning and fine-tuning let you:

  • Ship performant models fast by reusing learned representations.
  • Handle domain shift (e.g., generic language model to support tickets).
  • Reduce labeling needs with parameter-efficient techniques.
  • Control risk: avoid catastrophic forgetting and preserve general skills.
Real tasks you will face
  • Fine-tune a text encoder to classify support tickets into priority/intent.
  • Adapt an image model for your product catalog with only a few hundred labels.
  • Instruction-tune a small LLM with LoRA to answer domain FAQs safely.

Concept explained simply

Transfer learning = start from a model trained elsewhere; fine-tuning = adapt some or all of its weights to your task.

  • Feature extraction: freeze the backbone; train a small head on top.
  • Partial fine-tuning: unfreeze top N layers after the head stabilizes.
  • Full fine-tuning: update all layers (needs more data/compute).
  • Parameter-efficient fine-tuning (PEFT): add small trainable modules (e.g., adapters, LoRA, prompt/prefix tuning) and keep the base frozen.
Mental model

Imagine inheriting a polyglot brain. You first attach a small classifier (head) to check if it can already do your task. If not, you “open up” deeper layers slowly, using tiny learning rates so you don’t erase general knowledge. When compute is tight, you keep the big brain frozen and only train small add-ons (adapters/LoRA).

Choosing a fine-tuning strategy

  • Very small dataset (≤ 1k) and similar domain: feature extraction, train head; maybe unfreeze last 1–4 layers later.
  • Small-to-medium (1k–20k) or modest shift: partial fine-tune or PEFT (adapters/LoRA).
  • Large dataset (≥ 50k) and significant shift: full fine-tuning if compute allows; otherwise PEFT with careful regularization.
  • Tight VRAM/latency constraints: PEFT (LoRA/adapters) for speed and portability.
Heuristics you can apply
  • Start frozen → measure → gradually unfreeze if underfitting.
  • Use discriminative learning rates (smaller for lower layers).
  • Prefer PEFT when you must deploy multiple domain variants or share base weights.

Worked examples

Example 1: Text classification with a pretrained encoder

  1. Base: a pretrained transformer encoder.
  2. Data: 3k labeled support tickets, 5 classes, imbalanced.
  3. Plan: freeze backbone, train classification head. If plateau: unfreeze last 2–4 layers.
  4. Hyperparams: LR head 2e-4; LR backbone 2e-5 after unfreezing; warmup 5%; weight decay 0.01; batch 16; max epochs 5–10 with early stopping on macro F1.
  5. Imbalance: class-weighted loss or weighted sampler; macro F1 + per-class recall.
  6. Validation: stratified 5-fold CV to stabilize metrics.
Expected outcome

Strong F1 quickly; minimal overfitting if LR small and unfreezing gradual.

Example 2: Image classification for a niche catalog

  1. Base: pretrained CNN/ViT.
  2. Data: 800 labeled images, 12 classes.
  3. Plan: feature extraction first (freeze all), train linear head. If underfits, unfreeze last block with LR 1e-5; head LR 1e-3.
  4. Augmentation: color jitter, random crop, mixup/cutmix to combat overfitting.
  5. Schedule: cosine LR with warmup; early stopping on macro F1/top-1.
Expected outcome

Good accuracy with few labels; unfreezing last block yields +2–5% top-1 in many cases.

Example 3: LoRA instruction-tuning a small LLM

  1. Base: 7B parameter LLM; target: domain QA responses.
  2. Data: 50k instruction-response pairs + 2k safety refusals.
  3. Plan: LoRA on attention projections; rank r=8–16; alpha 16–32; dropout 0.05–0.1.
  4. Training: context length set to cover typical prompts; LR 1–2e-4 for LoRA params; warmup 2–5%; epochs 1–3; gradient checkpointing and mixed precision to fit memory.
  5. Evaluation: exact match/ROUGE on held-out; refusal rate on unsafe prompts; human spot checks.
Expected outcome

Domain-aligned answers with small trainable footprint; base model remains reusable.

Hyperparameters that matter

  • Learning rate: tiny for pretrained layers (e.g., 1e-5 to 5e-5 for encoders), larger for heads (1e-3 to 1e-4).
  • Warmup: 2–10% steps to avoid sudden large updates.
  • Weight decay: 0.01 typical; tune for overfitting control.
  • Batch size: scale LR with effective batch; use gradient accumulation if needed.
  • Scheduler: cosine or linear decay commonly stable.
  • Freezing schedule: head-only → unfreeze top N layers when validation underfits.
  • Regularization: dropout, data augmentation, label smoothing, mixup/cutmix (vision).

Data preparation essentials

  • Splits: stratified train/val/test; deduplicate near-duplicates across splits.
  • Text: minimal cleaning; keep domain terms; ensure tokenizer fits domain (extend vocab only if necessary).
  • Vision: consistent resolution, normalization; augmentations appropriate to task.
  • Leakage check: no test-derived thresholds; compute normalization on train only.
Class imbalance checklist
  • Use class weights or resampling.
  • Track macro F1 and per-class recall.
  • Consider focal loss for extreme imbalance.

Evaluation and monitoring

  • Baselines: zero-shot or head-only baseline before deeper tuning.
  • Metrics: macro F1, AUROC for imbalance; calibration (ECE) if needed.
  • Validation: k-fold for small data; keep a final untouched test set.
  • Reproducibility: fix seeds; log versions, hyperparams, and metrics.
  • Drift: monitor post-deployment; set up periodic re-eval and data freshness checks.

Practical projects

  • Project 1: Fine-tune a text encoder for multi-label tagging of support tickets; deliver a confusion matrix and error analysis report.
  • Project 2: Adapt a vision backbone to classify product categories with ≤ 1k images using partial unfreezing and mixup; compare head-only vs partial FT.
  • Project 3: LoRA-tune a small LLM to answer internal FAQ; evaluate with a small human-rated rubric for helpfulness and safety.

Exercises

These exercises are also available in the interactive panel below. Your progress saves if you are logged in; otherwise, you can still complete them for practice.

Exercise 1 (mirrors ex1)

You have 3k labeled support tickets across 5 classes with moderate imbalance. Compute is limited. Propose a fine-tuning plan: model choice, freezing schedule, learning rates, regularization, and evaluation.

  • Checklist:
    • Head-only start and when to unfreeze
    • Discriminative LRs
    • Imbalance handling
    • Early stopping criteria
    • Metrics and validation strategy
Hint

Start simple with a frozen backbone, then unfreeze gradually if underfitting.

Expected outcome (high level)

A concise plan that justifies head-first, small LR for backbone, macro F1 focus, and stratified CV.

Exercise 2 (mirrors ex2)

You must adapt a 7B LLM to domain QA with 50k instruction-response pairs. Memory is tight. Describe a PEFT approach (modules, ranks, hyperparams), safety evaluation, and deployment considerations.

  • Checklist:
    • Which layers to target
    • LoRA rank/alpha/dropout
    • LR, warmup, epochs
    • Safety/guardrail evaluation
    • Merging or on-the-fly adapters
Hint

Prefer LoRA on attention projections; keep base frozen.

Expected outcome (high level)

A LoRA plan with r around 8–16, LR ~1–2e-4 for adapters, and clear safety checks.

Common mistakes and self-check

  • Too high LR on pretrained layers → catastrophic forgetting. Self-check: compare zero-shot vs fine-tuned performance; large drop on unrelated tasks is a red flag.
  • Unfreezing too much too soon → overfitting. Self-check: widening train–val gap early.
  • Ignoring imbalance → poor minority recall. Self-check: per-class metrics.
  • Data leakage via dedup or normalization on full data. Self-check: re-run splits after dedup; fit scalers on train only.
  • Tokenizer or image preprocessing mismatch with base model. Self-check: use the base model’s exact preprocessing pipeline.
  • No baseline. Self-check: always log head-only results before deeper tuning.

Mini challenge

Pick a domain (e.g., legal Q&A or rare-species image classification) and draft a one-page fine-tuning plan that includes:

  • Base model and why
  • Data split and leakage checks
  • Chosen strategy (feature extraction, partial FT, or PEFT)
  • LRs, scheduler, warmup
  • Regularization and augmentation
  • Metrics and validation
  • Safety or fairness checks (if applicable)
  • Deployment plan (latency/memory)
  • Rollback criteria
  • Monitoring signals post-launch

Who this is for

  • Applied Scientists and ML Engineers shipping models under real-world constraints.
  • Data Scientists moving from training from scratch to adapting foundation models.

Prerequisites

  • Comfort with supervised learning and overfitting/underfitting concepts.
  • Basic understanding of deep learning architectures (transformers/CNNs).
  • Familiarity with metrics like precision/recall/F1 and cross-validation.

Learning path

  • Start: review overfitting, regularization, and evaluation basics.
  • Learn transfer types: feature extraction → partial FT → full FT → PEFT.
  • Practice: run a head-only baseline; then unfreeze gradually.
  • Advance: apply LoRA/adapters for text; mixup/cutmix for vision.
  • Polish: monitoring, drift, and safe deployment.

Next steps

  • Turn one worked example into a reproducible project with clear metrics.
  • Prepare an ablation: head-only vs partial FT vs PEFT; document trade-offs.
  • Take the quick test below to lock in the concepts. Anyone can take it; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

You have 3k labeled support tickets across 5 classes with moderate imbalance and limited compute. Draft a fine-tuning plan covering:

  • Base model choice and preprocessing
  • Freezing schedule (start and when to unfreeze)
  • Discriminative learning rates
  • Imbalance handling (loss/sampling/metrics)
  • Validation and early stopping criteria
Expected Output
A concise, stepwise plan that starts head-only, uses small LR for backbone, addresses class imbalance, and validates via stratified k-fold with macro F1.

Fine Tuning And Transfer Learning — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Fine Tuning And Transfer Learning?

AI Assistant

Ask questions about this tool