Why this matters
As an Applied Scientist, you drive decisions. Clear presentation of results and tradeoffs helps stakeholders pick the best option with eyes open to costs, risks, and benefits.
- Deciding model thresholds: balance customer experience vs. risk containment.
- Choosing between two models: higher accuracy vs. higher latency and compute cost.
- Rolling out experiments: quantifying uplift alongside fairness and safety impacts.
- Planning launches: explaining confidence, assumptions, and what you will monitor post-deploy.
Concept explained simply
Presenting results is more than showing numbers. It is telling a decision story: goal → what you tried → what happened → what it costs → what you recommend → how you will manage risks.
Mental model: The Product-Model Value Triangle
- Value: business/user impact (e.g., revenue, safety, satisfaction).
- Performance: metrics quality (e.g., recall, MAPE, NDCG).
- Cost/Risk: latency, compute $, maintenance, fairness, privacy, operational complexity.
Every decision moves these corners. Your job is to make those movements explicit and comparable.
Key terms to communicate
- Result: What changed in metrics (e.g., +3.2 pp recall).
- Tradeoff: What you gave up to get it (e.g., +20 ms latency).
- Assumption: Condition needed for the result to hold (e.g., base rate ~1%).
- Confidence: Uncertainty range and why (e.g., 95% CI, power, sensitivity checks).
- Recommendation: What to do next and why now.
A simple, repeatable structure
-
1) Objective and decision
State the business question and the decision needed.
- Objective: Reduce fraud loss with minimal impact on good users.
- Decision: Pick threshold for v2 model for phase-1 rollout.
-
2) Data and method (one slide)
- Data scope: timeframe, segments, leakage checks.
- Method: model type, validation, experiment design.
- Guardrails: fairness slices, latency budgets, privacy constraints.
-
3) Results (headline first)
- Headline: “v2 reduces expected loss by 18% (95% CI: 12–24%).”
- Evidence: core metric, uncertainty, key slices.
- Visuals: PR curve or cost curve; include error bars.
-
4) Tradeoffs (make costs explicit)
- Latency: +18 ms (within 50 ms budget).
- Compute: +$120/day inference; +2h/week maintenance.
- Fairness: small recall drop on low-activity users (−1.1 pp).
-
5) Recommendation and plan
- Recommendation: Ship at threshold T1 to 25% traffic for 2 weeks.
- Risk handling: monitor false positives; add slice-specific threshold.
- Decision ask: approve staged rollout and budget for extra compute.
Appendix: full metrics table, ablations, diagnostics, and alternative options.
Worked examples (3)
Example 1 — Ranking model: CTR vs latency
Context: New re-ranker adds +2.1 pp CTR but adds latency.
- Result: CTR +2.1 pp (baseline 8.0% → 10.1%), 95% CI [+1.5, +2.7].
- Tradeoffs: +24 ms p95 latency (budget +40 ms), +$70/day compute.
- Slices: Low-end devices +35 ms; others +18 ms.
- Recommendation: Rollout to 50% except low-end devices (gated); pursue quantization.
One-sentence framing: “If we accept +24 ms latency (within budget), we get ~26% relative CTR lift and ~$4.2k/week revenue.”
Example 2 — Forecasting: Accuracy vs maintainability
- ARIMA: MAPE 12.8; easy to maintain; transparent; retrain weekly (30 min).
- XGBoost: MAPE 10.2; better accuracy; feature drift risk; retrain daily (2 h).
- Cost model: 2.6 MAPE improvement ≈ $9k/week inventory savings; extra ops ≈ $1.5k/week.
- Recommendation: XGBoost for high-volume SKUs only; ARIMA for tail.
Example 3 — Safety: Precision–Recall tradeoff
- Threshold A: Recall 0.80, Precision 0.30, flags 600/day; FN cost $50; FP cost $1.
- Threshold B: Recall 0.50, Precision 0.60, flags 100/day.
- At 10k items/day, 1% harmful: A → TP 80, FP 520, FN 20; cost = 20×50 + 520×1 = $1,520. B → TP 50, FP 50, FN 50; cost = 50×50 + 50×1 = $2,550.
- Recommendation: Use A, then reduce FPs with rules for known false-positive patterns.
Choosing metrics and quantifying tradeoffs
- Classification: prefer PR curves and cost curves when classes are imbalanced.
- Ranking: report NDCG/MRR, clicks/session, and latency p95/p99.
- Forecasting/regression: MAPE/WAPE with confidence bands and error by segment.
Quick cost-of-error calculator
Expected cost = (FN × cost_FN) + (FP × cost_FP) + (Latency_ms × cost_per_ms) + (Compute_hours × hourly_cost).
Use this to compare options apples-to-apples.
Templates and phrasing that work
- Decision frame: “To achieve [goal], we compare [Option A] vs [Option B]. We recommend [choice] because [evidence], accepting [tradeoff].”
- Uncertainty: “Estimate: +2.1 pp (95% CI +1.5 to +2.7). If seasonality shifts by ±20%, impact remains positive.”
- Risk plan: “We’ll monitor [metric] daily; rollback if it degrades by >X% for Y hours.”
- Fairness: “On segment S, recall is −1.1 pp. We’ll mitigate via [step] before full rollout.”
Exercises
Do these, then compare with the solutions in the exercise toggles.
Exercise 1 — Pick a threshold using cost
- Use the counts and costs from Example 3.
- Compute total daily expected cost for Threshold A and B.
- Choose the threshold and write a 2–3 sentence justification including the tradeoff.
Exercise 2 — 5-slide executive readout
- Draft slides: Objective, Method, Results, Tradeoffs, Recommendation.
- Include one uncertainty statement and one risk mitigation step.
- Write a 1-sentence “If we accept X, we get Y” line.
Self-check checklist
- The decision and success metric are stated up front.
- Tradeoffs include latency/compute and at least one risk/guardrail.
- Uncertainty is quantified (CI, power, or sensitivity).
- Slices/fairness are mentioned if relevant.
- Clear recommendation and rollout plan.
Common mistakes and how to self-check
- Hiding costs: Show compute, latency, maintenance, and fairness together with benefits.
- Metric soup: Lead with 1–2 primary metrics; move the rest to appendix.
- No uncertainty: Always add intervals or sensitivity results.
- Overgeneralizing: Call out assumptions and where results may not hold.
- Fancy visuals, unclear takeaway: Add a one-line headline on each slide.
Self-audit mini-list
- Can a non-ML stakeholder choose an option after your first 2 minutes?
- Is the tradeoff phrased as “If we accept X, we get Y”?
- Is there a rollback/monitoring plan?
Practical projects
- Cost curve builder: Given precision–recall points, compute total cost across thresholds and pick the minimum-cost threshold.
- Latency-budget pitch: Simulate a 20 ms latency increase and quantify user impact vs. revenue lift; produce the tradeoff slide.
- Fairness slice review: Analyze 3 user segments and write a mitigation plan for the worst segment.
Who this is for
- Applied Scientists and ML Engineers presenting to PMs, executives, and partner teams.
- Data Scientists moving from analysis to decision ownership.
Prerequisites
- Basic understanding of your task metrics (e.g., PR/ROC, NDCG, MAPE).
- Ability to compute simple business costs of errors.
- Familiarity with your system’s latency and compute budgets.
Learning path
- Learn to translate metrics into business impact (cost-of-error).
- Practice the 5-slide structure with a past project.
- Add uncertainty and slice analysis to your default workflow.
- Rehearse a 60-second executive summary and a 5-minute deep dive.
- Ship with a monitoring and rollback plan.
Next steps
- Complete the exercises and compare with solutions.
- Build the cost curve on your current model and pick a threshold.
- Share your 5-slide draft with a peer and revise based on feedback.
Mini challenge
Write a 4-sentence executive summary for a model that improves recall by 5 pp at the cost of +15 ms latency and +$50/day compute, with a −0.8 pp recall drop for new users. Include a mitigation and a rollout plan.
Example answer
We recommend model v3 at threshold T1: recall improves by 5 pp (95% CI 3–7), increasing weekly fraud prevention by ~$8k. The tradeoff is +15 ms p95 latency and +$50/day compute, both within budget. New users see −0.8 pp recall; we’ll apply a slightly lower threshold for that segment. Roll out to 25% traffic for 2 weeks and monitor recall/latency daily with rollback if recall drops >2 pp for 24 hours.
Quick test and progress note
The quick test is available to everyone; only logged-in users will have their progress saved.