Topic Not Found

Why this matters

In real projects, you do not ship a probability; you ship a decision. Calibration makes your model's predicted probabilities match reality (e.g., 0.7 means ~70% chance). Thresholding turns those probabilities into actions while balancing precision, recall, and business costs. Together, they reduce false alarms, missed cases, and wasted spend.

Fraud detection: choose a threshold that keeps chargebacks low without swamping investigators.
Medical triage: meet a minimum sensitivity while controlling unnecessary follow-ups.
Marketing: match daily outreach capacity by selecting the top-scoring customers.

Quick Test note: anyone can take the test below. Logged-in users get saved progress.

Concept explained simply

Calibration answers: "When my model says 0.8, how often is it actually true?" A well-calibrated model's predicted probabilities align with observed frequencies. Thresholding answers: "At what probability do we act?"

Mental model

Think of weather forecasts. If a forecaster says "30% rain" on 100 days and it rains on ~30 of those, they are calibrated. If you decide to carry an umbrella when rain probability ≥ 0.5, that is thresholding. In ML, we want both: trustworthy probabilities and a threshold that fits cost/benefit and constraints.

Key metrics and tools

Brier score: average squared error between predicted probability and outcome (0/1). Lower is better.
Log loss: penalizes confident wrong predictions. Lower is better.
Calibration curve (reliability diagram): compare predicted vs observed rate across bins.
ECE (Expected Calibration Error): weighted average gap between predicted and observed rates across bins (smaller is better).
ROC/PR curves: for threshold selection under different class balances.

Quick sanity checks you can do fast

Bucket predictions into 5–10 bins and compute observed positive rate in each. Does it track the average predicted probability?
Plot precision/recall vs threshold to see trade-offs.
Stress-test thresholds against business constraints (e.g., max 50 alerts/day).

Methods for calibration

Platt scaling: fit a logistic regression on model scores (or logits) to map them to calibrated probabilities.
Isotonic regression: flexible, non-parametric monotonic mapping; great with enough validation data, can overfit on small data.
Temperature scaling (for deep nets): scale logits by a single parameter; preserves ranking, adjusts confidence.
Binning (histogram binning): average observed rate per bin; simple baseline.

How to choose a calibration method

Small validation set: Platt scaling or temperature scaling (low-variance).
Large validation set: isotonic regression can capture non-linear distortions.
Keep a separate holdout set to verify improved ECE/Brier score after calibration.

When to calibrate

When you will interpret probabilities directly (risk scores, ranking, cost-based thresholding).
When class distribution has shifted or the model is overconfident (common in deep nets).
After hyperparameter tuning, perform calibration on a fresh validation set to avoid leakage.

Threshold selection strategies

Metric-driven: choose threshold that maximizes F1, Youden's J (TPR − FPR), or meets a target precision/recall.
Cost-driven: use business costs. If cost(false positive)=C_FP and cost(false negative)=C_FN, classify positive when p ≥ C_FP / (C_FP + C_FN) on well-calibrated probabilities.
Capacity-driven: pick threshold to produce a fixed number of positives per day (quantile on scores).
Risk-driven: set separate thresholds by segment (e.g., stricter for high-risk groups) but review fairness and drift.

Decision rule with costs (plain language)

Predict positive when the expected cost of not acting (p × C_FN) exceeds the cost of acting when unnecessary ((1 − p) × C_FP). Rearranging gives p ≥ C_FP / (C_FP + C_FN).

Worked examples

Example 1 — Cost-optimal threshold for fraud alerts

Costs: C_FP = 1 (investigate unnecessarily), C_FN = 5 (missed fraud). With well-calibrated probabilities, decision threshold p* = 1 / (1 + 5) ≈ 0.167. This is lower than 0.5 because missing fraud is costly.

Why this helps

Lowering the threshold increases recall, reducing expensive misses at the expense of more (cheaper) false alarms.

Example 2 — Calibration curve sanity check

Suppose for predictions around 0.5, the observed positive rate is 0.8. The model is under-confident in that region. Calibration (e.g., isotonic) can adjust the mapping so 0.5 scores better reflect reality.

Before vs after

Before: ECE ~ 0.20, Brier score ~ 0.18
After isotonic: ECE ~ 0.07, Brier score ~ 0.15

Numbers are illustrative; use a holdout set to confirm.

Example 3 — Meeting a clinical requirement

A triage model must reach sensitivity ≥ 0.95. Sweep thresholds, pick the smallest threshold where TPR ≥ 0.95 while recording precision. Report both and the expected daily positives to ensure staffing can handle the volume.

Step-by-step workflow

Split data: train, validation (for threshold/calibration), and final holdout.
Train model and compute predicted probabilities on validation set.
Assess calibration: reliability diagram, ECE, Brier score.
If needed, fit a calibration map (Platt, isotonic, or temperature) on validation.
Re-evaluate ECE/Brier on a holdout to confirm improvement.
Choose threshold: by metric target, cost rule, or capacity constraint.
Stress-test: simulate volume, confusion matrix, and cost at the chosen threshold.
Document: method, threshold, expected trade-offs, and monitoring plan.

Exercises

These exercises match the ones below the lesson. Do them here, then open solutions when stuck.

Exercise 1 — Cost-based threshold and cost comparison

Data (12 cases):

id p    y
1  0.05 0
2  0.12 0
3  0.18 1
4  0.22 0
5  0.28 1
6  0.31 0
7  0.41 1
8  0.55 1
9  0.62 0
10 0.74 1
11 0.85 1
12 0.93 1
Costs: C_FP=1, C_FN=5

Compute p* = C_FP / (C_FP + C_FN).
Classify at thresholds 0.5 and p*; compute FP, FN, and total cost = FP*C_FP + FN*C_FN.
Which threshold is cheaper?

Open a hint

At p* = 1 / 6 ≈ 0.167, almost all cases ≥ 0.167 become positive; compare FP and FN counts.

Exercise 2 — Build a reliability table (ECE)

Using the same 12 cases, create 5 bins: [0–0.2), [0.2–0.4), [0.4–0.6), [0.6–0.8), [0.8–1.0]. For each bin compute:

n (count), avg predicted p, observed positive rate.
Gap = observed − avg predicted.
ECE ≈ sum over bins of (n/N) × |gap|.

Open a hint

Compute averages per bin; N = 12. Keep 2–3 decimals.

Checklist

[ ] You calculated p* correctly.
[ ] You reported FP, FN, and total cost at both thresholds.
[ ] Your ECE is a single number between 0 and 1.
[ ] You noted at least one bin where the model was over- or under-confident.

Common mistakes

Assuming 0.5 is the "right" threshold. It rarely is; use costs, targets, or capacity.
Calibrating on the same data used for model training and threshold tuning. Keep a clean validation or holdout set.
Optimizing AUC then ignoring precision/recall at the chosen threshold. Always report operating-point metrics.
Overfitting with isotonic regression on tiny datasets. Prefer Platt or temperature scaling when data is scarce.
Forgetting prevalence shifts. Monitor calibration drift and re-calibrate when base rates change.

Self-check

Did your chosen threshold meet the business target (e.g., precision ≥ X or alerts/day ≤ Y)?
Did calibration reduce ECE/Brier on a holdout set (not just on validation)?
Are you logging predicted probabilities and outcomes to audit calibration over time?

Practical projects

Alerting system: Build a simple classifier, calibrate it, then deploy a script that flags top-N cases per day based on a dynamic threshold.
Cost simulator: Given C_FP and C_FN sliders, simulate expected cost across thresholds using holdout predictions.
Drift monitor: Weekly reliability table and ECE trend; auto-warn when ECE exceeds a threshold.

Who this is for

Data Scientists and ML Engineers who need reliable probabilities and principled decisions from classification models.

Prerequisites

Basics of classification metrics (precision, recall, ROC/PR curves).
Understanding of train/validation/test splits and avoiding data leakage.
Comfort with arrays, grouping, and simple statistics.

Learning path

Before: Confusion matrix and ROC/PR fundamentals.
Now: Calibration (curves, ECE) and thresholding strategies (metric-, cost-, capacity-driven).
Next: Cost-sensitive learning, class imbalance handling, and post-deployment monitoring.

Next steps

Take the Quick Test below. Anyone can take it; log in to save your progress.
Apply calibration and threshold selection to your latest classification project and document the chosen operating point.
Set up weekly monitoring for ECE and key metrics at the deployed threshold.

Mini challenge

Your model's validation metrics are: AUC=0.89, ECE=0.18 at threshold 0.5, precision=0.62, recall=0.71. The business requires precision ≥ 0.75 and can review at most 200 cases/day. Propose a plan in 4 steps to calibrate, re-select a threshold, and verify both the precision target and the 200/day limit. Keep it concise.

Menu

Calibration And Thresholding

Table of Contents