Why this matters
As an NLP Engineer, you will often decide whether to rely on smart prompting of a general model or to fine-tune a model for your task. The right choice impacts cost, latency, accuracy, safety, and maintenance. Real tasks include:
- Building a domain QA assistant with consistent tone and legal-safe outputs
- High-volume entity extraction from documents
- Customer support routing and summarization
- Generating structured outputs for downstream systems
Who this is for
- Engineers evaluating LLM approaches for production use
- Data scientists designing experiments and evaluations
- Technical product managers balancing speed, cost, and risk
Prerequisites
- Basic understanding of transformer models, tokens, and inference
- Comfort with evaluation metrics (accuracy, F1, BLEU/ROUGE or task-specific)
- Familiarity with prompt design patterns (zero-shot, few-shot, instruction prompts)
Concept explained simply
Two broad ways to tailor LLMs to your task:
- Prompting: You craft instructions and examples in the prompt to steer a general model. Fast to iterate, no training needed.
- Fine-tuning: You train the model (fully or with parameter-efficient methods like LoRA/adapters) on task data to bake behavior into weights. More setup, but consistent and efficient at scale.
Mental model
Think of prompting as renting an expert for each question: you pay per call, give context each time, and results may vary with phrasing. Fine-tuning is hiring and training your own specialist: onboarding cost upfront, but after that they respond faster, cheaper per task, and in a consistent house style.
Key decision factors
- Quality & difficulty: If the base model already performs well with clear instructions, prompting may be enough. If your domain is niche or needs strict structure, fine-tuning often wins.
- Data availability: If you have thousands of labeled examples or logs, fine-tuning becomes attractive. If you have little data, start with prompting.
- Cost profile: Prompting has near-zero setup but higher marginal cost per request; fine-tuning has upfront cost but can reduce per-request cost at scale.
- Latency & throughput: Fine-tuned, smaller models can be much faster. Prompted large models may be slower due to long contexts.
- Context length & retrieval: If you must inject long documents each time, consider retrieval to reduce prompt size, or fine-tune to internalize format/style.
- Privacy & compliance: Sensitive data in prompts may raise risk. Fine-tuning on sanitized datasets can minimize repeated exposure.
- Controllability & consistency: Fine-tuning improves deterministic behavior and formatting; prompting can drift with small wording changes.
- Maintenance & updates: Prompts are easy to update; fine-tunes require new training runs. Choose based on update cadence.
- Safety & guardrails: Fine-tuning (and post-processing) can embed safety behaviors; prompts alone may be easier to jailbreak.
- Multilingual & style: Persistent style or multilingual norms are easier to encode via fine-tuning.
Rules of thumb
- Start simple: prompt + retrieval baseline. Measure.
- If you need strict structure, high volume, or special domain jargon, move toward fine-tuning.
- When in doubt: prototype both quickly on a subset and compare cost-quality-latency.
Worked examples
Example 1: Customer email triage
Goal: Route emails into 12 categories and extract ticket priority.
- Baseline: Prompt a general model with label definitions + few-shot examples.
- Observation: Good accuracy but slower and costs grow with volume.
- Shift: Train a small model via LoRA with 8k labeled emails; serve behind a lightweight API.
- Outcome: Faster, cheaper per request, consistent labels; occasional domain drift handled with periodic fine-tune updates.
Example 2: Policy-compliant summarization
Goal: Summaries must exclude PII and follow a fixed template.
- Prompt-only: Sometimes includes sensitive details despite instructions.
- Fine-tune: Instruction-tune on compliant summaries + negative examples; add output schema checks.
- Result: Higher compliance and format consistency.
Example 3: Structured data extraction
Goal: Extract company name, date, and amount into JSON.
- Prompt-only: Works but occasional field swaps.
- Fine-tune small model + JSON schema validation; add small set of hard negatives.
- Result: Near-perfect structure adherence; latency reduced.
Example 4: Domain QA with long references
Use retrieval to fetch relevant paragraphs; start with prompting. If style and consistency remain issues, fine-tune on Q/A pairs with citations and chain-of-thought hidden from output (reason privately, report cleanly).
Cost and latency rough math
Make simple estimates before building:
- Prompting per-request cost (example numbers): cost = input_tokens * c_in + output_tokens * c_out. Example: 800 in, 200 out, c_in = $0.003 per 1k, c_out = $0.006 per 1k gives ~ $0.0036 per request. For 10k/day, ~$36/day. Varies by provider; treat as rough ranges.
- Fine-tuning costs: one-time training + serving. If serving a small fine-tuned model costs ~$0.0008 per 1k tokens compute and you use 300 tokens total, cost/request ~ $0.00024. For 10k/day, ~$2.40/day. Training might cost tens to hundreds of dollars depending on data and setup. Varies by country/company; treat as rough ranges.
- Latency: Prompting large models with long contexts may be 500ms–several seconds. Fine-tuned small models can be sub-200ms on moderate hardware. Numbers vary; benchmark your stack.
Throughput planning checklist
- Estimate peak RPS and P95 latency target
- Size context to fit within budget
- Batch where possible (safe for independent requests)
- Set timeouts and retries conservatively
Evaluation plan template
- Define success: exact match, F1 on entities, format accuracy, human rating, refusal rate
- Create a frozen test set with easy, medium, hard cases
- Compare: Prompt baseline vs fine-tune (same test set)
- Track: cost/request, tokens/request, P95 latency, throughput
- Safety: test jailbreaks, prompt injection, PII leaks
Mini evaluation tips
- Use a small manual audit set for qualitative checks
- Add adversarial and out-of-domain examples
- Log disagreements and create error buckets
Exercises
Do these now. A quick test waits at the end.
Exercise 1: Choose the approach
For each scenario, pick Prompting or Fine-tuning (or Hybrid) and justify briefly:
- A. Daily 20k requests: extract 3 fields from invoices with strict JSON output
- B. Weekly 50 requests: convert meeting notes to friendly summaries in company tone
- C. Coding Q/A assistant for internal frameworks with proprietary APIs
- D. Multilingual sentiment classification for 8 languages; you have 2k labels total
- Checklist:
- State the chosen approach per scenario
- Mention at least 2 decision factors per choice
- Note expected cost/latency profile in one line
Show guidance
High-volume + strict format tends to favor fine-tuning; low volume + style needs can start with prompting; proprietary domain knowledge may need retrieval and possibly fine-tuning; multilingual with limited data may start with prompting plus carefully curated few-shot, then expand labels for fine-tuning.
Exercise 2: Back-of-the-envelope costs
Assume:
- Prompt baseline: 700 input tokens + 150 output tokens
- Costs: $0.003 per 1k input tokens; $0.006 per 1k output tokens
- Fine-tuned model serving: $0.0008 per 1k tokens total
- Traffic: 12k requests/day
Compute daily cost for prompting vs fine-tuned serving. Then estimate break-even training budget if fine-tuning costs X dollars upfront.
- Checklist:
- Show formulas
- Show both daily costs
- Solve for X where 30 days of savings = X
Show solution
Prompt cost/request = 0.7k * 0.003 + 0.15k * 0.006 = 0.0021 + 0.0009 = $0.003. Daily = 12,000 * 0.003 = $36.
Fine-tuned cost/request = 0.85k * 0.0008 = $0.00068. Daily = 12,000 * 0.00068 ≈ $8.16.
Daily savings ≈ $36 - $8.16 = $27.84. Over 30 days: ≈ $835.2. Break-even training budget X ≈ $835 if you want payback in ~1 month. Varies by country/company; treat as rough ranges.
Common mistakes and self-check
- Over-indexing on a single metric: Also check structure adherence, safety, latency.
- Ignoring data drift: Re-evaluate regularly; stale prompts or fine-tunes degrade.
- Long prompts with irrelevant context: Trim or use retrieval to keep costs down.
- No schema validation: For structured outputs, validate and retry on schema errors.
- Skipping safety tests: Always include refusal/PII/jailbreak checks.
Self-check
- Can you explain your choice with at least three factors?
- Do you have a test set that represents production?
- Do you know your per-request cost and P95 latency?
Practical projects
- Prompt baseline with guardrails
- Design a clear instruction prompt; add 3–5 few-shot examples
- Add JSON schema validation and simple retries
- Measure accuracy, format errors, cost, latency
- Parameter-efficient fine-tune (LoRA)
- Create a 5k example dataset from logs; anonymize PII
- Train with small learning rate; evaluate on frozen test set
- Compare cost/latency to prompt baseline
- Hybrid RAG + fine-tune
- Use retrieval for facts; fine-tune for style/format
- Test with adversarial inputs and long documents
Learning path
- Start: Prompt patterns and evaluation basics
- Next: Retrieval-augmented generation (RAG)
- Then: Parameter-efficient fine-tuning (LoRA/adapters) and dataset curation
- Finally: Safety testing, monitoring, and retraining cadence
Next steps
- Pick one of the practical projects and run a 1-week spike
- Set up an A/B evaluation with a frozen test set
- Decide on a 30-day plan for either prompt hardening or a small fine-tune
Mini challenge
You must build a multilingual contract clause extractor for English, Spanish, and German, with 30k requests/day and JSON output. Outline:
- Your chosen approach (prompting, fine-tuning, or hybrid)
- Data you need and how you will get it
- Evaluation metrics and safety checks
- Estimated cost and latency targets
Quick Test
Everyone can take the test. Only logged-in users will see saved progress and history.