Why this matters
As an Applied Scientist, your models must move product metrics, not just benchmark scores. Choosing offline metrics well lets you iterate quickly, catch issues before A/B tests, and create clear launch gates.
- Ship features faster: know which offline metric predicts online success.
- Protect the business: add guardrail metrics (e.g., fairness, safety, latency).
- Make trade-offs explicit: turn costs and risks into measurable targets.
Real tasks you will face
- Define the "north-star" offline metric for a new model (e.g., PR AUC for fraud).
- Pick guardrails that must not regress (e.g., unsafe content rate, group recall).
- Write the evaluation plan used for launch decisions and A/B gating thresholds.
Concept explained simply
Offline evaluation metrics quantify model quality without live traffic. Good selection means your offline results predict what happens online.
Mental model
- North-star metric: the single score you optimize to move your primary product goal.
- Guardrails: metrics that must not get worse (e.g., fairness, calibration, latency, content safety).
- Diagnostics: metrics that explain why (e.g., per-segment recall, calibration error, error bars).
Categories of common metrics
Binary classification
- Threshold-free: ROC AUC (ranking quality), PR AUC (rare positives focus).
- Thresholded: Precision, Recall, F1, Specificity, FPR, TPR.
- Probabilistic: Log loss (cross-entropy), Brier score; Calibration error (ECE).
Ranking / Recommenders / Search
- NDCG@K, MAP@K, MRR, Recall@K, HitRate@K.
- Coverage, diversity/novelty (guardrails), long-tail recall.
Regression / Forecasting
- MAE, RMSE (penalizes large errors more), MedAE (robust).
- WAPE (weighted by actuals), Quantile loss (pinball), P90 absolute error.
- Use SMAPE when zeros exist; avoid plain MAPE with zeros.
Cost-sensitive / Utility
- Expected profit or cost using FP/FN costs.
- Uplift/Qini-based metrics for treatment effect models.
Stability & Uncertainty
- Cross-validation variance, bootstrapped confidence intervals.
- Segment-level metrics and error bars.
Fairness & Safety (guardrails)
- Group TPR/FPR parity, calibration parity, safety violation rate.
Worked examples
Example 1: Fraud classification (rare positives)
Goal: catch fraudulent transactions with minimal good-user friction.
- Data: 1% fraud rate (100 positives in 10,000).
- At threshold 0.8: TP=60, FP=240, FN=40, TN=9,660.
- Precision = 60/(60+240) = 0.20
- Recall = 60/100 = 0.60
- FPR = 240/9,900 ≈ 0.024
Metric choice
- North-star: PR AUC (sensitive to rare positives).
- Guardrails: FPR at threshold ≤ 3%; Calibration (ECE) ≤ 0.05.
- Cost metric: Expected cost = 2×FP + 50×FN = 2×240 + 50×40 = 2,480. Optimize threshold to minimize expected cost.
Why not accuracy?
With 1% positives, predicting all negatives gives 99% accuracy but zero recall. Accuracy hides failure on rare events.
Example 2: Search ranking (graded relevance)
Goal: prioritize highly relevant results near the top.
- Relevance at positions 1..5: [3,2,0,1,2]. Using DCG with 2^rel−1 gains.
- DCG@5 = 7/1 + 3/1.585 + 0/2 + 1/2.322 + 3/2.585 ≈ 10.48
- IDCG@5 (sorted [3,2,2,1,0]) ≈ 10.82
- NDCG@5 ≈ 10.48/10.82 ≈ 0.97
Metric choice
- North-star: NDCG@10 (position-aware, graded).
- Guardrails: Content safety violation rate ≤ baseline; Long-tail coverage ≥ baseline.
- Diagnostics: Recall@50 per query segment; diversity index.
Example 3: Demand forecasting (heterogeneous scale)
Goal: minimize inventory error across SKUs of very different volumes.
- SKUs: A: actual 100, forecast 110 (|err|=10); B: 1000 vs 900 (|err|=100); C: 0 vs 20 (|err|=20).
- MAE = (10+100+20)/3 ≈ 43.33
- RMSE = sqrt((100 + 10,000 + 400)/3) ≈ 59.16
- WAPE = (10+100+20)/(100+1000+0) ≈ 11.8%
Metric choice
- North-star: WAPE (weights by actual demand).
- Guardrails: P90 absolute error ≤ target; No negative inventory predictions.
- Note: Avoid plain MAPE (division by zero for SKU C). Consider SMAPE if needed.
How to choose metrics (step-by-step)
- Define the outcome: What business KPI should move (revenue, safety, retention)? What is a good offline proxy?
- Pick one north-star: Choose an offline metric that aligns with the outcome (e.g., PR AUC for rare positives, NDCG@K for ranking).
- Add guardrails: Safety, fairness, calibration, latency, and any domain constraints.
- Design evaluation protocol: Time-based or stratified splits; segment-level reporting; bootstrap CIs.
- Cost/utility: If costs are asymmetric, define expected cost and choose thresholds that maximize expected utility.
- Document thresholds and tie-breaking: Fixed K, score thresholds, dedup rules, filtering logic.
- Define gate criteria: What must improve and by how much (with confidence) to ship.
Mini task: turn business goals into metrics
- Goal: reduce harmful content views in recommendations.
North-star: Recall@K for human-labeled safe items? Not quite. Prefer: reduce unsafe content rate (guardrail) while maintaining NDCG@K on relevance (north-star). - Goal: increase signup conversions.
North-star: PR AUC on conversion propensity; Threshold chosen by expected profit given acquisition cost.
Common mistakes and self-checks
- Using accuracy for imbalanced data. Self-check: Compare PR AUC and recall at target precision instead.
- Optimizing ROC AUC when positives are rare. Self-check: Inspect PR curves and PR AUC.
- Using MAPE with zeros. Self-check: Switch to SMAPE or WAPE.
- No uncertainty estimates. Self-check: Report bootstrap 95% CIs and per-segment variability.
- Mismatch between offline data and serving distribution. Self-check: Time-split, exclude leakage, align candidate sets.
- No guardrails. Self-check: Add fairness, safety, calibration, latency thresholds.
- Ambiguous definition (e.g., what is a "click"). Self-check: Write metric spec with inclusion/exclusion rules.
- Comparing models at different thresholds. Self-check: Use threshold-free metrics or consistent threshold selection.
Quick self-audit checklist
- North-star, guardrails, diagnostics are explicitly defined.
- Evaluation split mirrors production.
- Confidence intervals reported.
- Segment metrics reviewed (e.g., geography, device, user tenure).
- No leakage from future or labels.
- Clear thresholding and tie-break rules documented.
Exercises
Do the exercise below and compare with the solution when done.
- Your chosen north-star metric and guardrails.
- Threshold selection rule (if applicable).
- One paragraph explaining trade-offs.
Practical projects
- Build an evaluation report: For an imbalanced classifier, compute PR AUC, plot precision-recall at different thresholds, add bootstrap CIs, and propose a launch gate.
- Ranking evaluation: Implement NDCG@K, MAP@K, Recall@K across query segments. Compare two candidate generation strategies.
- Forecasting dashboard: Compute WAPE, RMSE, P90 error, and segment results by product tier. Add a simple threshold-based alert for drift.
- Cost-sensitive thresholding: Estimate FP and FN costs from historical logs; choose a threshold that maximizes expected profit.
Learning path
- Before this: Problem definition and target choice → Data splitting and leakage prevention → Baselines and sanity checks.
- This subskill: Pick metrics that align with business outcomes; specify guardrails and diagnostics; define evaluation protocol.
- Next: Online experiment design, variance reduction, sequential testing → Post-launch monitoring and model health → Model iteration and documentation.
Who this is for
- Applied Scientists, ML Engineers, and Data Scientists preparing models for production.
- Product-leaning scientists who must justify launch decisions.
Prerequisites
- Comfort with classification, ranking, and regression basics.
- Understanding of train/validation/test splits and cross-validation.
- Basic probability and statistics (confidence intervals, bootstrapping).
Next steps
- Instrument your code to output all chosen metrics and segment breakdowns.
- Add bootstrap CIs and variance awareness to your reports.
- Draft a one-page metric spec (north-star, guardrails, thresholds, gates).
Progress and test info
The quick test is available to everyone. Only logged-in users have their progress saved.
Mini challenge
You're ranking notifications to maximize daily active users (DAU). Labels are clicks; some notifications can annoy users if too frequent.
- Pick a north-star offline metric and justify.
- Add two guardrails related to user experience and safety.
- Describe how you would ensure the offline evaluation matches the serving distribution.
Suggested direction
NDCG@K as north-star (position-aware). Guardrails: notification fatigue rate proxy (e.g., hide/dismiss rate) must not increase; unsafe/spam content rate ≤ baseline. Use time-based split and match candidate generation and filters used in serving.