luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Offline Evaluation Metrics Selection

Learn Offline Evaluation Metrics Selection for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Why this matters

As an Applied Scientist, your models must move product metrics, not just benchmark scores. Choosing offline metrics well lets you iterate quickly, catch issues before A/B tests, and create clear launch gates.

  • Ship features faster: know which offline metric predicts online success.
  • Protect the business: add guardrail metrics (e.g., fairness, safety, latency).
  • Make trade-offs explicit: turn costs and risks into measurable targets.
Real tasks you will face
  • Define the "north-star" offline metric for a new model (e.g., PR AUC for fraud).
  • Pick guardrails that must not regress (e.g., unsafe content rate, group recall).
  • Write the evaluation plan used for launch decisions and A/B gating thresholds.

Concept explained simply

Offline evaluation metrics quantify model quality without live traffic. Good selection means your offline results predict what happens online.

Mental model

  • North-star metric: the single score you optimize to move your primary product goal.
  • Guardrails: metrics that must not get worse (e.g., fairness, calibration, latency, content safety).
  • Diagnostics: metrics that explain why (e.g., per-segment recall, calibration error, error bars).
Categories of common metrics

Binary classification

  • Threshold-free: ROC AUC (ranking quality), PR AUC (rare positives focus).
  • Thresholded: Precision, Recall, F1, Specificity, FPR, TPR.
  • Probabilistic: Log loss (cross-entropy), Brier score; Calibration error (ECE).

Ranking / Recommenders / Search

  • NDCG@K, MAP@K, MRR, Recall@K, HitRate@K.
  • Coverage, diversity/novelty (guardrails), long-tail recall.

Regression / Forecasting

  • MAE, RMSE (penalizes large errors more), MedAE (robust).
  • WAPE (weighted by actuals), Quantile loss (pinball), P90 absolute error.
  • Use SMAPE when zeros exist; avoid plain MAPE with zeros.

Cost-sensitive / Utility

  • Expected profit or cost using FP/FN costs.
  • Uplift/Qini-based metrics for treatment effect models.

Stability & Uncertainty

  • Cross-validation variance, bootstrapped confidence intervals.
  • Segment-level metrics and error bars.

Fairness & Safety (guardrails)

  • Group TPR/FPR parity, calibration parity, safety violation rate.

Worked examples

Example 1: Fraud classification (rare positives)

Goal: catch fraudulent transactions with minimal good-user friction.

  • Data: 1% fraud rate (100 positives in 10,000).
  • At threshold 0.8: TP=60, FP=240, FN=40, TN=9,660.
  • Precision = 60/(60+240) = 0.20
  • Recall = 60/100 = 0.60
  • FPR = 240/9,900 ≈ 0.024

Metric choice

  • North-star: PR AUC (sensitive to rare positives).
  • Guardrails: FPR at threshold ≤ 3%; Calibration (ECE) ≤ 0.05.
  • Cost metric: Expected cost = 2×FP + 50×FN = 2×240 + 50×40 = 2,480. Optimize threshold to minimize expected cost.
Why not accuracy?

With 1% positives, predicting all negatives gives 99% accuracy but zero recall. Accuracy hides failure on rare events.

Example 2: Search ranking (graded relevance)

Goal: prioritize highly relevant results near the top.

  • Relevance at positions 1..5: [3,2,0,1,2]. Using DCG with 2^rel−1 gains.
  • DCG@5 = 7/1 + 3/1.585 + 0/2 + 1/2.322 + 3/2.585 ≈ 10.48
  • IDCG@5 (sorted [3,2,2,1,0]) ≈ 10.82
  • NDCG@5 ≈ 10.48/10.82 ≈ 0.97

Metric choice

  • North-star: NDCG@10 (position-aware, graded).
  • Guardrails: Content safety violation rate ≤ baseline; Long-tail coverage ≥ baseline.
  • Diagnostics: Recall@50 per query segment; diversity index.

Example 3: Demand forecasting (heterogeneous scale)

Goal: minimize inventory error across SKUs of very different volumes.

  • SKUs: A: actual 100, forecast 110 (|err|=10); B: 1000 vs 900 (|err|=100); C: 0 vs 20 (|err|=20).
  • MAE = (10+100+20)/3 ≈ 43.33
  • RMSE = sqrt((100 + 10,000 + 400)/3) ≈ 59.16
  • WAPE = (10+100+20)/(100+1000+0) ≈ 11.8%

Metric choice

  • North-star: WAPE (weights by actual demand).
  • Guardrails: P90 absolute error ≤ target; No negative inventory predictions.
  • Note: Avoid plain MAPE (division by zero for SKU C). Consider SMAPE if needed.

How to choose metrics (step-by-step)

  1. Define the outcome: What business KPI should move (revenue, safety, retention)? What is a good offline proxy?
  2. Pick one north-star: Choose an offline metric that aligns with the outcome (e.g., PR AUC for rare positives, NDCG@K for ranking).
  3. Add guardrails: Safety, fairness, calibration, latency, and any domain constraints.
  4. Design evaluation protocol: Time-based or stratified splits; segment-level reporting; bootstrap CIs.
  5. Cost/utility: If costs are asymmetric, define expected cost and choose thresholds that maximize expected utility.
  6. Document thresholds and tie-breaking: Fixed K, score thresholds, dedup rules, filtering logic.
  7. Define gate criteria: What must improve and by how much (with confidence) to ship.
Mini task: turn business goals into metrics
  • Goal: reduce harmful content views in recommendations.
    North-star: Recall@K for human-labeled safe items? Not quite. Prefer: reduce unsafe content rate (guardrail) while maintaining NDCG@K on relevance (north-star).
  • Goal: increase signup conversions.
    North-star: PR AUC on conversion propensity; Threshold chosen by expected profit given acquisition cost.

Common mistakes and self-checks

  • Using accuracy for imbalanced data. Self-check: Compare PR AUC and recall at target precision instead.
  • Optimizing ROC AUC when positives are rare. Self-check: Inspect PR curves and PR AUC.
  • Using MAPE with zeros. Self-check: Switch to SMAPE or WAPE.
  • No uncertainty estimates. Self-check: Report bootstrap 95% CIs and per-segment variability.
  • Mismatch between offline data and serving distribution. Self-check: Time-split, exclude leakage, align candidate sets.
  • No guardrails. Self-check: Add fairness, safety, calibration, latency thresholds.
  • Ambiguous definition (e.g., what is a "click"). Self-check: Write metric spec with inclusion/exclusion rules.
  • Comparing models at different thresholds. Self-check: Use threshold-free metrics or consistent threshold selection.
Quick self-audit checklist
  • North-star, guardrails, diagnostics are explicitly defined.
  • Evaluation split mirrors production.
  • Confidence intervals reported.
  • Segment metrics reviewed (e.g., geography, device, user tenure).
  • No leakage from future or labels.
  • Clear thresholding and tie-break rules documented.

Exercises

Do the exercise below and compare with the solution when done.

What to submit
  • Your chosen north-star metric and guardrails.
  • Threshold selection rule (if applicable).
  • One paragraph explaining trade-offs.

Practical projects

  • Build an evaluation report: For an imbalanced classifier, compute PR AUC, plot precision-recall at different thresholds, add bootstrap CIs, and propose a launch gate.
  • Ranking evaluation: Implement NDCG@K, MAP@K, Recall@K across query segments. Compare two candidate generation strategies.
  • Forecasting dashboard: Compute WAPE, RMSE, P90 error, and segment results by product tier. Add a simple threshold-based alert for drift.
  • Cost-sensitive thresholding: Estimate FP and FN costs from historical logs; choose a threshold that maximizes expected profit.

Learning path

  • Before this: Problem definition and target choice → Data splitting and leakage prevention → Baselines and sanity checks.
  • This subskill: Pick metrics that align with business outcomes; specify guardrails and diagnostics; define evaluation protocol.
  • Next: Online experiment design, variance reduction, sequential testing → Post-launch monitoring and model health → Model iteration and documentation.

Who this is for

  • Applied Scientists, ML Engineers, and Data Scientists preparing models for production.
  • Product-leaning scientists who must justify launch decisions.

Prerequisites

  • Comfort with classification, ranking, and regression basics.
  • Understanding of train/validation/test splits and cross-validation.
  • Basic probability and statistics (confidence intervals, bootstrapping).

Next steps

  • Instrument your code to output all chosen metrics and segment breakdowns.
  • Add bootstrap CIs and variance awareness to your reports.
  • Draft a one-page metric spec (north-star, guardrails, thresholds, gates).
Progress and test info

The quick test is available to everyone. Only logged-in users have their progress saved.

Mini challenge

You're ranking notifications to maximize daily active users (DAU). Labels are clicks; some notifications can annoy users if too frequent.

  • Pick a north-star offline metric and justify.
  • Add two guardrails related to user experience and safety.
  • Describe how you would ensure the offline evaluation matches the serving distribution.
Suggested direction

NDCG@K as north-star (position-aware). Guardrails: notification fatigue rate proxy (e.g., hide/dismiss rate) must not increase; unsafe/spam content rate ≤ baseline. Use time-based split and match candidate generation and filters used in serving.

Practice Exercises

1 exercises to complete

Instructions

You are building a credit default classifier. Positives (defaults) are 2% of cases. Business costs: FN (missed default) costs 200 units; FP (incorrectly flagging) costs 5 units due to manual review and customer friction. Your validation at threshold 0.7 yields: TP=320, FP=1,200, FN=80, TN=18,400 (out of 20,000).

  • Choose a north-star offline metric and explain why.
  • Propose two guardrails.
  • Define a threshold selection rule using expected cost.
  • Compute expected cost at threshold 0.7 and state if you would explore moving the threshold up or down.
Expected Output
A clear metric plan with: (1) north-star aligned to rare positives or cost, (2) guardrails for FPR/calibration/fairness, (3) an expected-cost formula and numeric cost at the given threshold, and (4) a recommendation on threshold direction.

Offline Evaluation Metrics Selection — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Offline Evaluation Metrics Selection?

AI Assistant

Ask questions about this tool