Why this matters
As an Applied Scientist, you ship models that run under real constraints. Users expect fast responses, finance expects predictable spend, and product expects high-quality outcomes. Balancing cost, latency, and quality is how you move from a good model in a notebook to a reliable feature in production.
- Real task 1: Choose between a larger model with +2% quality and a smaller model that is 2× cheaper and 40% faster.
- Real task 2: Keep p95 latency under 300 ms while traffic doubles on a fixed budget.
- Real task 3: Design a cascade with a lightweight first stage and a high-accuracy second stage only when needed.
Key outcomes you enable
- Stable SLAs (p95/p99 latency) during peak load
- Predictable unit economics (cost per request or per user)
- Quality that moves a business metric (CTR, conversion, retention, trust)
Concept explained simply
Think of three sliders you can move: cost, latency, and quality. Moving one slider often nudges the others. Your job is to find a setting that meets constraints (SLA, budget) while maximizing business value.
- Cost: All compute and usage fees per request or per period. Includes GPUs/CPUs, serverless invocations, memory, and model usage (e.g., tokens). Hidden costs include retries, cold starts, and precomputation jobs.
- Latency: Time from request to response, usually tracked as p50, p90, p95, p99. Tail latency (p95/p99) often matters more than averages.
- Quality: Task-specific metrics (accuracy, F1, ROC-AUC, NDCG, BLEU, human ratings). Tie it to a business outcome whenever possible.
Mental model: Pareto frontier
Plot model options by quality and latency (or cost). Points on the Pareto frontier are those where you cannot improve one dimension without worsening another. Choose a point that satisfies constraints and maximizes your utility, e.g.: Utility = wQ * Quality − wC * Cost − wL * LatencyPenalty. The weights come from business impact.
Measure and baseline
- Define constraints: e.g., p95 ≤ 300 ms, cost ≤ $0.002 per request.
- Choose a single quality metric per scenario (plus tie-breakers).
- Instrument: log per-request latency, retries, and outcome quality signals (or offline quality if online isn’t available).
- Establish a baseline variant and a repeatable evaluation harness (offline and canary/A-B online).
What to log on every request
- request_id, model_variant, features, batch_size, cache_hit
- latency_ms_total, service_breakdown_ms if available
- quality_proxy (if online), or offline label join later
- cost_proxy: estimated tokens, compute-seconds, or fixed price per call
Worked examples (3)
Example 1 — LLM classifier endpoint
Goal: Keep p95 ≤ 400 ms and cost ≤ $0.003 per request while maximizing F1.
- Variant A (base): F1 = 0.86, p95 = 380 ms, cost = $0.004
- Variant B (quantized + distilled): F1 = 0.84, p95 = 260 ms, cost = $0.0022
- Variant C (bigger model): F1 = 0.88, p95 = 520 ms, cost = $0.006
Decision: B meets both constraints and is only −0.02 F1 from A. If −0.02 F1 harms business significantly, consider a cascade: run B first, escalate to C only for low-confidence cases. Expected blended cost if 15% escalate: 0.85 * $0.0022 + 0.15 * $0.006 = $0.00283. Latency remains near B for 85% of traffic; tail increases for escalations—track p95 carefully.
Example 2 — Search two-stage ranking
Goal: Maximize NDCG@10 subject to p95 ≤ 200 ms.
- Candidate generation (CG): p95 = 40 ms, cost = $0.0002, recall high.
- Reranker R1 (small): +0.015 NDCG, +60 ms, +$0.0005.
- Reranker R2 (large): +0.03 NDCG, +140 ms, +$0.0013.
p95 budget left after CG: 160 ms. R2 leaves 20 ms buffer—risky under load; R1 leaves 100 ms buffer. A practical approach: route only top queries (say, long-tail excluded) or high-uncertainty cases to R2; others to R1. This keeps average quality high and tail latency controlled.
Example 3 — Real-time fraud scoring
Goal: p95 ≤ 150 ms, minimize loss = chargebacks + false positive friction.
- Rules engine: p95 = 8 ms, F1 = 0.70, negligible cost.
- ML model: p95 = 90 ms, F1 = 0.83, cost = $0.0008.
- Extra feature join (external service): adds p95 = 70 ms, improves F1 to 0.86, cost = +$0.0006.
Direct full pipeline hits p95 around 168 ms (8 + 90 + 70) — violates SLA. Design a cascade: apply rules → if high risk or high confidence, decide immediately; else call ML. Only for 25% ambiguous cases call external service. New expected p95 ≈ max path for 25% but p95 can still pass if ambiguity is rare. Monitor the tail and consider caching external responses.
Practical playbook
- Set constraints and utility
Write down SLA (p95), budget per request/day, and the quality metric tied to business impact. - Generate candidate variants
Create at least 3 options that span different parts of the tradeoff space (e.g., quantized, distilled, bigger model). Include a cascade option. - Benchmark
Measure p50/p95/p99, throughput at target QPS, and cost-per-request on realistic traffic. - Select + guardrail
Pick the best variant under constraints. Add guardrails: timeouts, max tokens, dynamic batch sizes, circuit breakers. - Iterate with targeted levers
Latency levers: caching, batching, concurrency, precompute, approximate search. Cost levers: quantization, smaller models, stop early, pruning, reuse features. Quality levers: distillation, better prompts/features, retrieval, reranking.
Cost levers
- Quantize/distill models
- Shorten prompts; cap tokens; early stop (LLM)
- Cache frequent results or embeddings
- Use cheaper hardware where feasible; autoscale to zero
- Batch requests where latency budget allows
Latency levers
- Keep hot paths in memory; avoid cold starts
- Right-size batch: too large hurts tail latency
- Parallelize IO; precompute heavy features
- Use approximate methods (e.g., ANN) with tunable recall
- Use cascades and confidence thresholds
Quality levers
- Improve training data; reduce label noise
- Add retrieval or reranking where it counts
- Calibrate probabilities; adjust decision thresholds
- Use ensembles only where they pay off
Common mistakes and self-check
- Mistake: Optimizing average latency instead of tail. Self-check: Are you tracking p95/p99 at peak?
- Mistake: Ignoring retries and timeouts. Self-check: Do logs include retry counts and their cost/latency impact?
- Mistake: Always choosing the biggest model. Self-check: Is there a cascade that achieves similar quality for less cost/latency?
- Mistake: Not validating quality deltas online. Self-check: Do you run canaries/A-B for significant changes?
- Mistake: Missing hidden costs (precompute, storage, data egress). Self-check: Is cost per request end-to-end, not just model call?
Exercises (hands-on)
Do these in a spreadsheet or notebook. They mirror the graded exercises below.
Exercise 1 — Choose a variant under constraints
Traffic: 100k requests/day. Constraints: p95 ≤ 300 ms; daily budget ≤ $200.
- Variant X: quality = 0.88, p95 = 320 ms, cost = $0.0018/req
- Variant Y: quality = 0.86, p95 = 240 ms, cost = $0.0016/req
- Variant Z: quality = 0.90, p95 = 280 ms, cost = $0.0024/req
Questions: Which variants are feasible? If none dominates, propose a cascade and compute blended cost if 20% of requests escalate from Y to Z.
Exercise 2 — LLM token budgeting
Two prompt templates:
- P1: 1200 input tokens, 80 output tokens on average, quality = 0.89
- P2: 700 input tokens, 60 output tokens on average, quality = 0.87
Cost per 1k tokens: input = $0.001, output = $0.002. Latency roughly 1.1 ms/token processed. SLA p95 ≤ 450 ms.
Tasks: Compute cost/request and expected p95 for P1 and P2. Which meets SLA and minimizes cost while keeping quality ≥ 0.88? Suggest a gating rule using a fast heuristic to route 25% hard cases to P1 and others to P2. Compute blended cost and latency.
- Checklist to finish:
- Computed cost per request for each option
- Computed p95 latency estimates
- Identified feasible options under constraints
- Proposed a cascade if needed and computed blended metrics
Note: The Quick Test is available to everyone; only logged-in users will have their progress saved.
Practical projects
- Build a two-stage cascade: Implement a lightweight classifier with confidence threshold; escalate to a larger model only when confidence < τ. Log blended cost/latency/quality.
- LLM endpoint with cost guardrails: Enforce max tokens, dynamic prompt truncation, and caching for frequent prompts. Track unit cost and p95 weekly.
- Retrieval system budget allocator: Tune ANN parameters to trade recall vs latency. Use a reranker for top K only. Plot Pareto frontier and pick a point under SLA.
Who this is for
- Applied Scientists moving models into production
- Data scientists owning online ML features
- ML engineers optimizing inference services
Prerequisites
- Comfort with model evaluation metrics (accuracy, F1, NDCG)
- Basic understanding of latency percentiles and throughput
- Ability to estimate unit costs (tokens, compute time)
Learning path
- Start: Understand constraints and set a baseline
- Next: Explore levers (quantization, batching, cascades)
- Then: Measure on realistic loads; compare variants on a Pareto chart
- Finally: Ship with guardrails and monitor tail latency and cost
Next steps
- Complete the exercises above and take the Quick Test to check understanding.
- Iterate on one project idea and track metrics weekly.
- Remember: everyone can take the test; only logged-in users will see saved progress.
Mini challenge
You must launch a content moderation endpoint with p95 ≤ 250 ms and cost ≤ $0.0015/request. Draft a plan that includes one fast model and one high-accuracy fallback. Specify the confidence threshold, expected escalation rate, and the blended cost/latency/quality you target. List the guardrails you will enable on day one.