Who this is for
- NLP engineers choosing a transformer base model before fine-tuning.
- ML practitioners balancing quality vs. latency, memory, and licensing.
- Students who want a simple, repeatable decision framework.
Prerequisites
- Basic understanding of transformer architectures (encoder, decoder, encoder-decoder).
- Familiarity with common NLP tasks (classification, NER, QA, summarization, translation, generation).
- Ability to run and fine-tune models on GPU/accelerator or CPU.
Learning path
- Learn task-architecture mapping.
- Quantify constraints (latency, memory, context length, languages, license, data size).
- Pick candidate families and sizes.
- Run a tiny bake-off with a fixed evaluation set.
- Decide and document trade-offs.
Why this matters
Choosing the right base model saves weeks of trial-and-error and cloud costs. Real tasks include:
- Shipping a high-precision product-review classifier under 50 ms latency.
- Summarizing long customer tickets within memory limits.
- Extracting entities from medical or legal text with domain constraints.
- Serving multilingual chat experiences with limited GPUs.
Concept explained simply
Pick the base model that is naturally good for your task, and that fits your constraints. If the base model matches your task shape, fine-tuning becomes easier, cheaper, and more robust.
Mental model
- Encoder-only (BERT-like): Best for understanding tasks (classification, NER, sentence/paragraph-level tasks). Usually faster and smaller.
- Encoder-decoder (T5/BART-like): Best for text-to-text tasks (summarization, translation, style transfer). Balanced for conditional generation.
- Decoder-only (GPT-like): Best for open-ended generation (chat, drafting) and long-form reasoning. Often heavier for simple classification.
Practical decision framework (9 criteria)
- Task shape: understanding vs. conditional generation vs. open generation.
- Latency target: p50/p95 in ms; smaller encoders excel at low-latency classification.
- Memory/compute: available GPU/CPU, precision (fp16, int8), batch size.
- Context length: max tokens needed end-to-end, including prompt and output.
- Languages: mono vs. multilingual; need for specific scripts.
- Domain: general vs. biomedical, legal, finance; prefer models pre-trained in-domain if available.
- Data size: small fine-tune sets favor models closer to your task; large sets can adapt broader models.
- Licensing: commercial vs. research use; model and training data licenses must match your use.
- Serving mode: batch offline vs. real-time interactive.
Licensing quick guide
- Check model license: allows commercial use? Attribution required?
- Check training data terms if provided.
- If unclear, choose a model with a permissive commercial license.
Latency cheat sheet (rule-of-thumb)
- Encoder-only small/medium (100M–400M params): tens of ms on a single modern GPU.
- Encoder-decoder base (200M–800M): 50–200 ms for short outputs; depends on output length.
- Decoder-only large (7B+): 100 ms to seconds; strongly depends on generation length.
These are rough guides; measure on your hardware.
Worked examples
Example 1: Multilingual sentiment for e-commerce
- Task: Product review classification (10 languages), p95 < 80 ms.
- Constraints: 1 GPU at inference, small fine-tune dataset, commercial use.
- Choice: Encoder-only multilingual model, small/medium (150M–300M).
- Why: Understanding task, multilingual need, low latency, small data favors encoders.
- Context: Short (reviews), so long context not required.
- License: Select permissive commercial license.
Example 2: Legal clause classification
- Task: Multi-label classification of contract clauses; high precision needed.
- Constraints: CPU inference acceptable, batch offline processing, domain-specific.
- Choice: Encoder-only legal-domain model if available; otherwise general encoder fine-tuned with domain data. Size 300M–400M.
- Why: Understanding-focused, domain benefits encoders; CPU batch friendly.
- Context: Medium length; sliding window if needed.
Example 3: Ticket summarization with long context
- Task: Summarize support conversations up to 12k tokens.
- Constraints: Real-time preview < 300 ms for short tickets, up to 1s for long ones.
- Choice: Encoder-decoder with extended context window; consider long-sequence variants. Medium size (500M–1B) depending on hardware.
- Why: Conditional generation fits encoder-decoder; long context required.
- Serving: Two-tier strategy: small model for short inputs; larger for long ones.
Example 4: Chat-style drafting assistant
- Task: Open-ended generation, tone control, tool-free.
- Constraints: 1–2 GPUs, target < 800 ms for short replies.
- Choice: Decoder-only instruction-tuned base; 3B–7B for balanced quality/latency.
- Why: Open-ended generation and dialogue quality are strengths of decoder-only models.
Sizing and memory tips
- Parameter memory (rough): fp16 ≈ 2 bytes/param; int8 ≈ 1 byte/param. Activations add 30–100% depending on sequence length and batch.
- If memory-bound: try int8/4-bit quantization, gradient checkpointing (for training), or smaller variants.
- For throughput: favor encoders for classification; for generation, limit max new tokens.
Exercises
Do these before the quick test. Everyone can take the test; only logged-in users have their progress saved.
Exercise 1: Map a real scenario to a base model
Scenario: You need to detect hate speech in short social posts across 5 languages. Hard latency p95 < 60 ms, single mid-range GPU for serving, small labeled dataset (30k examples), commercial product, explainability helpful.
- Choose an architecture family.
- Pick an approximate parameter range.
- State required context length.
- Specify multilingual/tokenizer considerations.
- Note licensing requirements.
- Give a 3–4 sentence rationale.
Write your answer as a short design note.
Exercise 2: Latency and memory planning
Target: Classification on 256-token inputs, batch size 8, p95 < 70 ms on a 16 GB GPU.
- Estimate feasible model size using the memory rule-of-thumb.
- Pick a precision/quantization plan.
- Propose back-off options if latency exceeds target.
Deliverable: a 5–7 line plan describing model family, size, precision, and mitigation steps.
Exercise checklist
- Your choice matches the task shape (understanding vs. generation).
- Latency and memory are justified with numbers or rules-of-thumb.
- Languages and licensing are explicitly addressed.
- Context length and serving mode are considered.
Common mistakes and self-check
- Using decoder-only for simple classification when an encoder would be faster/cheaper. Self-check: does the model need to generate text?
- Ignoring license fit. Self-check: write your intended use and confirm it is allowed by the model license.
- Underestimating context length. Self-check: measure real token counts from your data.
- Over-sizing the model for small datasets. Self-check: try smaller variants first and compare validation metrics.
- Benchmarking only on averages. Self-check: examine p95 latency and worst-case long inputs.
Practical projects
- Build a multilingual toxicity classifier with an encoder-only base; document latency vs. accuracy across 3 sizes.
- Create a summarizer with an encoder-decoder base; compare short vs. long-context variants on your support tickets.
- Prototype a drafting assistant with a decoder-only base; measure response quality vs. generation length limits.
Next steps
- Run a small bake-off: 2–3 candidate bases, same dataset and metrics.
- Record trade-offs (quality, latency, memory, cost, license) in a one-page decision note.
- Proceed to fine-tuning with careful evaluation and safety checks.
Mini challenge
You must build a multilingual FAQ answerer that outputs short answers from a provided paragraph (no open-ended generation), p95 < 120 ms, 8k context, commercial license. Pick a base model family and size range, and justify in 5 lines. Hint: the model should condition on provided text and output short spans.
Progress saving
The quick test is available to everyone. Only logged-in users will have their progress saved automatically.