How to learn Distillation Basics for Optimization And Efficiency in Applied Scientist for free

Why this matters

Knowledge distillation lets you compress large, accurate models (teachers) into smaller, faster models (students) while keeping most of the performance. As an Applied Scientist, this is critical when you need to meet latency or memory budgets, ship on-device models, reduce inference costs, or scale serving throughput.

Real tasks you will face

Meeting a 20–30 ms latency SLO for recommendation or ranking.
Deploying an on-device NLP classifier under 10 MB while keeping >95% of teacher accuracy.
Reducing cloud GPU serving costs by replacing teachers with distilled students.
Creating fast A/B test variants to validate product hypotheses quickly.

Who this is for

Applied Scientists and ML Engineers shipping models to production.
Researchers needing compact baselines for ablations.
Data Scientists exploring efficiency techniques beyond pruning and quantization.

Prerequisites

Comfort with supervised learning and cross-entropy loss.
Basic understanding of softmax, logits, and KL divergence.
Ability to train a baseline model and evaluate metrics (accuracy, latency, memory).

Learning path

Understand teacher–student setup and temperature scaling.
Learn the standard KD loss and when to mix with hard labels.
Practice on a small image or text task; measure speed/accuracy trade-offs.
Explore feature and intermediate-layer distillation.
Combine with quantization/pruning for extra gains.

Concept explained simply

Distillation trains a small student to imitate a big teacher. Instead of only using the one-hot label ("cat"), the student also learns from the teacher’s softened probability distribution over classes (e.g., cat 0.62, dog 0.25, fox 0.13). These “soft targets” contain dark knowledge about class similarities.

Mental model

Think of the teacher as a knowledgeable mentor who not only gives the right answers but also explains how close other answers are. The student learns faster and generalizes better by studying both the final answer and the mentor’s reasoning.

Core ideas and the standard loss

Teacher–student: Train student parameters to mimic teacher behavior.
Temperature T: Use softmax(logits / T) with T > 1 to soften distributions.
Loss (common choice): L = α * CE(y, student) + (1 − α) * T^2 * KL(softmax(z_t/T) || softmax(z_s/T))
Why T^2: Keeps gradient magnitudes comparable across temperatures.
Variants: Feature distillation (match hidden layers), data-free distillation (synthetic data), multi-teacher distillation (average or ensemble behavior).

When to use which variant

Logit distillation: Default for classification when you have labeled data.
Feature distillation: Helpful when architectures differ a lot, or you have limited labels.
Data-free: When original training data is unavailable; requires synthetic data generation.

Worked examples

Example 1: Toy classification KD step

Teacher logits z_t = [2.0, 0.5, -1.0], Student logits z_s = [1.0, 0.2, -0.5], True class = 0, T = 2, α = 0.3.

Soft teacher p_t(T=2) ≈ [0.590, 0.279, 0.132]
Soft student p_s(T=2) ≈ [0.467, 0.313, 0.221]
Hard student p_s(T=1) ≈ [0.598, 0.269, 0.134]; CE_hard ≈ 0.514
KL(p_t || p_s) ≈ 0.037; T^2 * KL ≈ 0.149
Total L ≈ 0.3*0.514 + 0.7*0.149 ≈ 0.259

Example 2: Compressing a vision model

Goal: Replace ResNet-50 teacher with MobileNetV3 student.

Target: 2× faster inference, <1% top-1 drop.
Plan: α=0.5, T in {2, 4}, mix strong augmentations; early stop on best accuracy–latency Pareto.
Outcome: Student is 2.2× faster, −0.8% top-1, memory −60%.

Example 3: On-device intent classification

Teacher: Large transformer classifier. Student: 6-layer distilled model.

Constraint: On-device CPU, <15 ms, <8 MB.
Plan: α=0.3, T=3, intermediate layer matching (2 points), quantization-aware training.
Outcome: 94.5% of teacher accuracy, meets latency and size.

Implementation playbook (step cards)

Step 1 — Prepare

Choose teacher and define constraints (latency, memory, cost).
Pick student architecture sized to target device.
Decide metrics: accuracy/F1, latency p95, memory.

Step 2 — Configure KD

Start with α in [0.2, 0.7] and T in [2, 5].
Use label smoothing off initially; re-introduce later if needed.
Add early stopping on a validation Pareto score (e.g., accuracy − λ·latency).

Step 3 — Train

Compute teacher logits offline if possible to save time.
Mix hard and soft losses per batch.
Log T^2·KL and CE separately to understand training dynamics.

Step 4 — Evaluate & Iterate

Compare to teacher on the same eval set.
Profile latency on target hardware, not just dev machine.
Tune α, T, and regularization; consider feature distillation if plateauing.

Exercises

Exercise 1 — Compute KD on a toy example

Replicate the values in Worked Example 1. Then slightly change T from 2 to 4 and note how the KL term changes.

Inputs: z_t = [2.0, 0.5, -1.0], z_s = [1.0, 0.2, -0.5], y=0, α=0.3, T ∈ {2,4}.
Deliverables: p_t(T), p_s(T), CE_hard, T^2·KL, total L for both T values.

Hints

Softmax with temperature: softmax(z/T).
KL(p||q)=Σ p log(p/q). Use natural logs for consistency.
Don’t forget the T^2 factor.

Exercise 2 — Design a KD plan for on-device

Create a one-page plan to distill a sentiment classifier for mobile.

Constraints: <12 ms CPU, <6 MB, accuracy drop ≤1%.
Include: student choice, α, T, any feature-matching, data augmentation, stopping criteria, eval metrics.

Hints

Start with T in [2–4] and α around 0.3–0.5.
Quantization-aware training pairs well with KD.
Track accuracy and p95 latency together.

Exercise checklist

[ ] I computed both soft and hard distributions correctly.
[ ] I included T^2 in the KD loss.
[ ] My plan lists constraints, metrics, and tuning ranges.
[ ] I defined a clear stopping rule and success criteria.

Common mistakes and self-check

Forgetting T^2 scaling — check your logs for vanishing KL gradients at high T.
Only using soft labels — combine with CE to ground the student.
Mismatched evaluation — profile latency on the real device or target VM.
Over-regularizing — if both CE and KL stall high, reduce dropout/weight decay.
Ignoring distribution shift — validate on production-like traffic.

Self-check prompts

Does the student beat a non-distilled baseline of the same size?
Do α and T sweeps show a stable optimum?
Is accuracy–latency trade-off clearly measured and documented?

Practical projects

Image classification KD: ResNet teacher → MobileNet student. Goal: −<1% top-1, 2× faster. Include T/α sweep and Pareto plot.
NLP intent KD: Large transformer → 6-layer student with 2 intermediate matches. Add quantization-aware training; measure CPU latency.
Tabular ranking KD: GBDT teacher → shallow neural student via soft targets. Measure NDCG@k and throughput.

Mini challenge

Compress any teacher to a student that achieves at least 90% of teacher accuracy while halving latency. Provide: metrics table, confusion matrices, and a 3-sentence write-up of what α and T worked best and why.

Next steps

Explore feature and layer-wise distillation to close the final accuracy gap.
Combine KD with quantization/pruning to meet tight memory budgets.
Automate α and T tuning with small grid or Bayesian search.

Quick test

The quick test is available to everyone. If you log in, your progress will be saved.

Menu

Distillation Basics

Table of Contents