Who this is for
Applied Scientists, ML Engineers, and Product Data Scientists who design, evaluate, and ship ML features. If you propose experiments, compare models, or decide go/no-go for launches, this is for you.
Prerequisites
- Basic ML knowledge (classification, regression, ranking, generation)
- Comfort with common metrics (e.g., precision/recall, RMSE, NDCG)
- Understanding of train/validation/test splits
Why this matters
In real teams, projects fail less from model quality and more from unclear success targets. Well-defined success criteria and realistic baselines help you:
- Align with business impact (conversion, risk reduction, cost savings)
- Prevent overfitting to the wrong metric
- Decide quickly whether to iterate, pivot, or stop
- Communicate progress clearly to non-ML stakeholders
Concept explained simply
Success criteria define what “good enough to ship” means for your model. Baselines define “what we can beat today with simple methods.” Together they create a fair race and a clear finish line.
Mental model
- North Star: the primary metric tied to the user/business outcome.
- Guardrails: metrics that must not get worse (safety, fairness, latency, cost).
- Baseline ladder: start simple (naive/rules/previous system) and climb only if you beat each rung.
Common metric choices (quick reference)
- Classification (imbalanced): PR-AUC, Recall@Precision≥X, Cost-weighted error
- Regression/Forecast: RMSE/MAE, sMAPE, WAPE, Bias
- Ranking/Search: NDCG@K, MRR, Recall@K
- Recommendation: HitRate@K, MAP@K, Coverage, Diversity
- Generation: Human acceptance rate, Task success rate, Toxicity/safety score
The Success Spec (use this template)
Fill this before you train the model. Keep it to one page.
- Problem and user outcome: [What user pain or business goal?]
- Primary metric: [Name, definition, direction (higher/lower better)]
- Target threshold: [e.g., PR-AUC ≥ 0.42 or RMSE ≤ 12.0]
- Guardrails: [e.g., Latency p95 ≤ 120ms; Fairness ΔTPR ≤ 5% across groups; Unit cost ≤ $X]
- Offline dataset & split: [Time range, leakage checks, stratification, frozen version]
- Online metric(s) if A/B test: [e.g., CTR, conversion, complaint rate]
- Baseline ladder: [Naive → Rules → Last-prod → Simple ML → SOTA?]
- Decision rule: [Ship if beats last-prod by ≥ X% (CI excludes 0) and passes guardrails]
- Review cadence: [Weekly/biweekly; re-run with fresh data monthly]
Step-by-step process
- Clarify the outcome: What user action or risk are we changing? Define a measurable proxy.
- Choose a primary metric: Pick the metric that best tracks that outcome. Avoid single-number accuracy on imbalanced problems.
- Set target thresholds: Use historical data, cost of errors, and minimum viable impact to set a bar.
- Add guardrails: Safety, fairness, latency, stability, and cost constraints.
- Design baselines: Majority/naive, heuristic/rules, current production, simple ML.
- Freeze your eval: Define dataset, splits, and evaluation code. Version them.
- Run baselines first: Establish a reality check. If you can’t beat them, revisit features/data.
- Decide with evidence: Compare with confidence intervals; plan an online test if applicable.
Worked examples
Example 1 — Fraud detection (classification, imbalanced)
- Outcome: Reduce fraudulent transactions caught late.
- Primary metric: Recall at Precision ≥ 90% (high precision to avoid blocking legit users).
- Target: Recall ≥ 45% at Prec≥90% on 3-month rolling holdout.
- Guardrails: p95 latency ≤ 100ms; Fairness ΔFPR ≤ 2% across regions; Unit cost ≤ $0.002/decision.
- Baselines: (1) Majority non-fraud, (2) Rules v7, (3) Last-prod LR model.
- Decision: Ship if recall improves ≥ 5pp over last-prod at same precision and guardrails pass.
Example 2 — E-commerce search ranking
- Outcome: Better relevance for top results.
- Primary metric: NDCG@10 on judged queries; Secondary: Recall@50.
- Target: +4% relative NDCG@10 vs current ranker; CI excludes 0.
- Guardrails: p95 latency ≤ 150ms; Diversity index ≥ current; Zero severe policy violations.
- Baselines: (1) BM25 tuned, (2) Current LTR ranker, (3) Zero-shot transformer reranker.
- Decision: A/B test only if offline shows ≥4% NDCG@10 and latency budget holds.
Example 3 — Support reply suggestion (generation)
- Outcome: Agents resolve tickets faster with accurate, safe suggestions.
- Primary metric: Agent acceptance rate of suggestions.
- Secondary: Resolution time reduction; Safety violations (must be 0).
- Target: +10pp acceptance vs template baseline; Safety = zero critical violations.
- Baselines: (1) No-suggest, (2) Curated templates, (3) Last-prod small model.
- Decision: Rollout if acceptance increases and no safety regressions; conduct staged canary.
Example 4 — Demand forecasting (regression)
- Outcome: Reduce stockouts and overstock.
- Primary metric: WAPE on weekly forecasts; Bias within ±2%.
- Target: WAPE ≤ 18% (from 22%); Bias within ±2% across categories.
- Baselines: (1) Naive last-week, (2) Seasonal naive, (3) ETS, (4) Last-prod XGBoost.
- Guardrails: Compute cost ≤ current; Stability under holiday shifts.
Checklist — before you train
- Primary metric chosen for the real outcome
- Clear target threshold and decision rule
- At least two baseline levels defined
- Guardrails set (safety, fairness, latency, cost)
- Frozen dataset and split strategy
- Plan for offline confidence intervals and online power analysis
Common mistakes and how to self-check
- Mistake: Using accuracy on imbalanced data. Fix: Use PR-AUC or cost-weighted metrics.
- Mistake: No guardrails. Fix: Add latency, fairness, and safety constraints.
- Mistake: Vague baselines. Fix: Implement a naive and the current production system.
- Mistake: Moving targets mid-project. Fix: Freeze the Success Spec; update only via review.
- Mistake: Overfitting to validation. Fix: Use a time-based holdout; report CIs.
Exercises
Note: The quick test is available to everyone; only logged-in users will have their progress saved.
-
ex1 — Spam detection Success Spec
Write a one-page Success Spec for a mailbox spam classifier.
What to include
- Primary metric for imbalanced data
- Target thresholds and decision rule
- Guardrails (latency, safety)
- Baseline ladder
-
ex2 — Compute metrics and decide
Given counts on a 10k email holdout: TP=720, FP=80, FN=180, TN=9020. Baseline model had Precision=0.85, Recall=0.65. Your model targets: Precision ≥ 0.90, Recall ≥ 0.70. Decide pass/fail vs baseline and targets.
-
ex3 — Baseline ladder for search
Design a baseline ladder for a product search ranker with strict latency budget. Justify each rung.
Practical projects
- Build a Success Spec and baseline ladder for any public dataset task (classification or ranking). Report results with confidence intervals.
- Take a previous project you did and retro-fit a Success Spec. Identify what would have changed.
- Create a small safety/fairness guardrail evaluation for your task (e.g., group-wise metrics).
Learning path
- Start here: define success and baselines for your current task.
- Next: data splitting, leakage checks, and robust evaluation.
- Then: experiment design (power analysis, CI), and online testing.
- Finally: monitoring drift, regression tests, and model iteration loops.
Next steps
- Write and circulate a one-page Success Spec with your team.
- Implement and run all baselines. Share a one-pager with results and CIs.
- Prepare an A/B test plan with guardrails if offline targets are met.
Mini challenge
In one paragraph, define a ship/no-ship decision rule for a recommendation model that balances relevance and diversity under a 120ms latency cap. Include at least one primary metric and two guardrails.