Why this matters
Prompt engineers ship features that must be fast, affordable, and reliable. Measuring cost, latency, and quality lets you set clear targets, run fair comparisons, and iterate confidently without breaking budgets or user experience.
- [ ] Launch a summarizer without exceeding budget per user
- [ ] Keep P95 latency under your product’s responsiveness target
- [ ] Improve output quality while proving the change actually helps
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.
Concept explained simply
Think of an LLM feature as a machine with three dials:
- Cost: how much you pay per call and per month
- Latency: how long the user waits (focus on P50/P95)
- Quality: how often the output meets your definition of success
Mental model: The triangle. Tightening one side (e.g., faster latency) may loosen another (e.g., quality) unless you change design. Measure all three together to avoid regressions.
Core metrics and quick formulas
Cost
- Per-call token cost ≈ (input_tokens / 1,000 × input_price) + (output_tokens / 1,000 × output_price)
- Total cost ≈ per-call cost × calls
- Also consider: embeddings, reranking, tool calls, retries, and context expansion
- Guardrails:
- [ ] Max input tokens cap
- [ ] Max output tokens cap
- [ ] Refuse very long inputs or truncate safely
Latency
- Measure P50 (typical), P95 (tail). Track end-to-end: queue + retrieval + generation + post-processing
- Latency budget example: P95 ≤ 1200 ms, P50 ≤ 600 ms
- Common levers:
- Shorter prompts and outputs
- Faster retrieval or caching frequent results
- Streaming partial results to improve perceived latency
Quality
- Closed tasks (exact answers): accuracy, pass rate
- Open tasks (subjective): rubric pass rate, pairwise win rate, or preference score
- Safety and reliability: hallucination rate, refusal rate (when appropriate), policy violations
- Build a small labeled test set. Freeze it for unbiased comparison
Worked examples
Example 1 — Article summarizer
- Setup: Avg input 1200 tokens, output 200 tokens. Prices: input $0.50 per 1M, output $1.50 per 1M (hypothetical)
- Cost per call ≈ (1200/1,000,000 × 0.50) + (200/1,000,000 × 1.50) = $0.0006 + $0.0003 = $0.0009
- Latency: P50 520 ms, P95 980 ms; Meets budget (≤ 1200 ms)
- Quality rubric (binary checks):
- Contains 3–5 key points
- No new facts invented
- Plain-language, ≤ 120 words
- Quality result: 82/100 pass → 82% pass rate
Example 2 — RAG Q&A
- Retrieval: 10 docs, 200 ms; Generation: 700 ms; Post-process: 50 ms → P50 ≈ 950 ms
- Cost: Embeddings cached (0 for repeat), LLM: input 800, output 180 → ≈ $0.0004 + $0.00027 = $0.00067
- Quality: Faithfulness check (no claims unsupported by retrieved text) and Answer completeness (covers the question)
- Trade-off: Reducing top_k from 10 to 5 cut latency by ~120 ms with minimal quality change
Example 3 — Tool-using agent
- Flow: Plan (LLM) → Tool call (HTTP 300 ms) → Synthesize (LLM)
- Latency ≈ plan 350 ms + tool 300 ms + synth 600 ms = 1250 ms (P50). Tail driven by tool spikes
- Cost: Two LLM steps; enforce short system prompts and response caps to control spend
- Mitigation: Pre-warm tool, cache frequent tool responses, stream synthesis for better perceived speed
Designing your evaluation loop
- Define targets: Cost per call ceiling, P95 latency, quality acceptance criteria
- Build a test set: 50–200 items with clear expected outcomes or a rubric
- Freeze baseline: Keep prompts, parameters, and test set fixed
- Run offline eval: Compute cost estimate, latency stats, and quality metrics
- Iterate: Change one thing at a time; re-run; compare deltas
- Pre-release check: Budget gate (monthly projection), latency SLO, quality threshold
- Online monitor: Sample real traffic; watch P95 and error tags; roll back on regressions
Exercises
Do these in a notebook or spreadsheet. Re-run after each model or prompt change. Your test results are available to everyone here; only logged-in users get saved progress.
Exercise 1 — Plan cost and quality
You handle 5,000 calls/day. Avg input 800 tokens, output 200 tokens. Prices (hypothetical): input $0.50 per 1M tokens; output $1.50 per 1M tokens.
- [ ] Compute per-call cost
- [ ] Project 30-day cost
- [ ] Draft a 3-point rubric for quality pass/fail
Exercise 2 — Compute P50 and P95 latency
Given latencies (ms): 450, 520, 610, 480, 530, 700, 820, 560, 490, 510, 1500, 580, 640, 730, 540, 505, 520, 900, 610, 480
- [ ] Find P50 and P95
- [ ] Check against SLO: P95 ≤ 1200 ms
Rubric template (copy/paste)
- Relevance: Directly answers the user question (Yes/No)
- Grounding: No unsupported claims (Yes/No)
- Clarity/Format: Follows requested style or constraints (Yes/No)
Common mistakes and self-check
- Only tracking averages. Fix: Track P50 and P95 for latency and cost per call
- Changing multiple variables at once. Fix: Single-change iterations with a frozen test set
- Ignoring retries/timeouts. Fix: Include them in end-to-end metrics
- Unbounded context growth. Fix: Token caps and prompt audits
- Vague quality criteria. Fix: Clear rubric and examples; measure pass rate
- Too-small test sets. Fix: Start small (50–100), grow over time, sample by segment
Practical projects
- [ ] Build a small offline evaluator that: runs a fixed test set, logs input/output tokens, and computes per-call cost
- [ ] Create a latency dashboard that reports P50/P95 by endpoint and highlights tail causes (tool calls, long inputs)
- [ ] Define quality rubrics for two tasks (e.g., summary and Q&A) and measure pass rate across three prompt variants
Mini challenge
Your feature’s P95 latency is 1400 ms; budget requires ≤ 1200 ms without dropping quality. Choose two changes to test:
- [ ] Reduce top_k retrieval from 10 to 6
- [ ] Shorten few-shot examples by 40%
- [ ] Add streaming (improves perceived speed, not P95)
Predict the impact on cost, latency, and quality before testing.
Who this is for
- Prompt engineers and ML practitioners who ship LLM features
- Product-minded data scientists who own quality and performance
Prerequisites
- Basic understanding of prompts, tokens, and model I/O
- Comfort with spreadsheets or simple scripts to compute metrics
Learning path
- Define your task and success metrics
- Assemble a labeled test set and rubric
- Measure baseline cost, latency, and quality
- Iterate on prompts/models; re-measure and compare
- Establish SLOs and budget gates; monitor in production
Next steps
- [ ] Add pairwise preference testing for open-ended tasks
- [ ] Segment metrics by input length and user tier
- [ ] Tag and track common failure modes (e.g., hallucination, formatting)