Why this matters
Your users feel latency, your finance team sees cost, and your customers judge quality. As an AI Product Manager, you must balance these three on every feature launch, A/B test, and incident response.
- Launch decisions: pick a model, prompt, and routing that meet a latency SLO while staying within budget.
- Experiment design: test cheaper models with fallbacks to protect quality.
- Incident playbooks: degrade gracefully when models slow down or budgets are hit.
Concept explained simply
Every AI response has three prices you pay:
- Latency: time to first useful token and time to complete. Users leave if it is slow.
- Cost: money per call (model tokens + infra). Finance cares, budgets are real.
- Quality: task success rate (did it solve the user’s problem?).
Mental model: the triangle
Imagine a triangle with Latency, Cost, and Quality at each corner. Moving closer to one often pulls you away from the others. Your job is to pick a point that meets the product goal (SLOs, budget, satisfaction) for the given context.
Quick definitions
- p50/p95/p99 latency: median/95th/99th percentile of response time; design to p95.
- TTFT (time to first token): speed to start streaming; impacts perceived performance.
- TTC (time to complete): total time until done; impacts workflows.
- Direct cost: tokens × price; Indirect cost: infra, retries, monitoring.
- Quality metrics: task success rate, exact match, win-rate, human-rated scores, safety pass rate.
Practical levers to balance tradeoffs
- Prompt optimization: shorter, more specific prompts reduce tokens and may improve quality.
- Retrieval: provide targeted context instead of long prompts; cache common contexts.
- Model routing: default to a cheaper/faster model; escalate to a stronger model on hard cases.
- Streaming UI: show partial results to reduce perceived latency.
- Batching and parallelism: batch embeddings/classifications; parallel tools with timeouts.
- Quantization/distillation: smaller or distilled models for speed/cost-sensitive tasks.
- Caching: memoize deterministic prompts or post-processed outputs; set TTLs.
- Early exit: stop on confidence or after top evidence; cap tokens with max output length.
- Fallbacks: deterministic rules or templates when models time out or exceed budget.
Simple cost and latency estimators
Rough planning math (treat as estimates):
- Cost per request ≈ (input_tokens × price_in) + (output_tokens × price_out) + infra_overhead
- Monthly cost ≈ requests_per_month × cost_per_request
- Latency_total ≈ model_latency + retrieval_latency + tool_latency + network_overhead
- p95 latency ≈ max(p95s of major steps) when steps are sequential
Worked examples (with decisions)
1) Support bot: accuracy vs cost
Goal: 90% task success, p95 < 3.0s, budget ≤ $0.015/request.
- Option A: High-end model, 1,000 in + 300 out tokens (≈$0.020). Latency p95 ≈ 2.5s. Quality ≈ 92%.
- Option B: Mid-tier model, 600 in + 200 out tokens (≈$0.007). Latency p95 ≈ 1.8s. Quality ≈ 86%.
- Option C: Route: try B, escalate to A on low confidence (30% of cases). Expected cost ≈ 0.7×$0.007 + 0.3×$0.020 = $0.0119. Latency p95 ≈ max(B p95, A p95 on escalations) ≈ 2.5s. Quality ≈ ~90–91%.
Pick C. Meets quality, budget, and latency SLOs.
2) Moderation pipeline: speed first
Goal: p95 < 200ms for 95% of content; strict safety, budget ≤ $0.002/request.
- Small classifier: 120ms p95, $0.0004, 97% recall.
- LLM fallback: 2.0s p95, $0.008, 99.5% recall.
- Route: small classifier always; escalate to LLM only when borderline (≈5% cases).
Expected cost ≈ 0.95×0.0004 + 0.05×0.008 = $0.0008. p95 latency ≈ 200ms (since 95% handled by small model). Safety preserved via fallback.
3) Voice assistant: perceived speed
Goal: users need to feel instant. TTFT < 300ms, overall TTC < 2.5s.
- Enable streaming: show words as they arrive; TTFT drops to 200–300ms.
- Parallelize tools: call weather API and calendar lookup together; cap each at 600ms with fallbacks.
- Short prompts: pre-load instructions; use retrieval snippets instead of long system prompts.
Outcome: perceived speed improves while quality stays stable and costs drop from fewer tokens.
Decision framework you can reuse
- Define SLOs: p95 latency, target quality, and budget per request.
- Map user flows: where does latency matter (first paint vs final answer)?
- Choose baseline: start with the smallest model that meets quality in a pilot.
- Add safety nets: routing, fallbacks, timeouts, streaming.
- Instrument: log latency breakdown, token usage, and quality labels.
- Iterate: reduce tokens, cache, then consider model upgrades only if needed.
What to measure every release
- Latency: TTFT, TTC, p50/p95/p99 by route and model.
- Cost: tokens in/out, cache hit rate, per-request cost distribution.
- Quality: task success, human ratings, safety pass rate, escalation rate.
Data and monitoring basics
- Collect per-step timings (retrieval, model, tools) to find bottlenecks.
- Log prompts lightly but safely; store token counts instead of full text where necessary.
- Tag traffic cohorts: new vs returning users, device type, network region.
- Run canary releases for new prompts/routes; watch p95 and error rates closely.
Exercises
Do these now. The quick test does not save progress unless you are logged in, but everyone can take it.
Exercise 1: Budget and SLO check
Traffic: 500,000 requests/month. Latency target: p95 ≤ 2.2s. Budget: ≤ $6,000/month.
- Option X (fast model): 400 in + 150 out tokens, $0.008/request, p95 1.6s, success 88%.
- Option Y (strong model): 900 in + 300 out tokens, $0.018/request, p95 2.4s, success 92%.
- Hybrid: try X; escalate 25% to Y on low confidence.
- Q1: What is expected monthly cost of Hybrid?
- Q2: Does Hybrid meet p95 ≤ 2.2s? Explain.
Show example approach
Compute expected cost = 0.75×X + 0.25×Y. For p95, consider that 25% may hit Y's p95; overall p95 often dominated by the slower branch if its share is significant.
Exercise 2: Design a routing policy
Goal: search query rewrite with p95 ≤ 400ms, success ≥ 85%, cost ≤ $0.003/request.
- Small rewrite model: 220ms p95, $0.0006, 80% success.
- Classifier gate: 60ms p95, $0.0002, flags hard queries 30% of time.
- Large rewrite model: 950ms p95, $0.006, 92% success.
Propose a routing using these components. Estimate cost, p95, and success.
Hint
Consider: classifier -> small model for easy; escalate hard cases to large model. Think about whether to run classifier in parallel or sequentially.
Common mistakes and self-check
- Designing to p50 only. Self-check: do you track p95 by route? If not, fix it.
- Overusing a strong model without routing. Self-check: measure escalation rate and cost per request.
- Ignoring perceived latency. Self-check: instrument TTFT and enable streaming if applicable.
- Long prompts bloating cost. Self-check: token audits and prompt trims every release.
- No cache strategy. Self-check: report cache hit rate and savings weekly.
Practical projects
- Build a cost dashboard: tokens in/out, per-route cost, monthly forecast with alerts.
- Latency heatmap: show p50/p95 by step (retrieval, model, tools) and by geography.
- Routing A/B: compare single-model vs routed system with fallbacks; measure quality and cost.
Mini challenge
You have p95 = 3.1s vs target 2.5s, and cost = $0.014 vs target $0.012, quality is on target. You may change only prompts, caching, and routing—no new models. Propose 3 changes that together likely meet targets.
See one possible answer
- Trim prompt context by 40% using retrieval snippets and deduped instructions (cuts tokens, speeds decoding).
- Add cache for top 20% repeat queries with 24h TTL (reduces both latency and cost).
- Add low-confidence routing: default to mid model; escalate 20% to strong model; enable streaming for perceived speed.
Who this is for
- AI Product Managers shipping LLM features and assistants.
- Data/Product folks owning budgets, SLOs, and user satisfaction.
Prerequisites
- Basic understanding of prompts, tokens, and model APIs.
- Comfort with interpreting p50/p95 latency and simple cost math.
Learning path
- Understand latency, cost, and quality metrics (this lesson).
- Set SLOs and build a simple cost/latency dashboard.
- Implement routing, caching, and streaming in a pilot feature.
- Run A/B tests; iterate prompts and retrieval.
Next steps
- Complete the exercises above and take the quick test below.
- Pick one active feature and propose a routing + caching plan in one page.
- Schedule a token audit for your top prompts this week.
Progress saving note
The quick test is available to everyone. Only logged-in users will have their progress saved.