Why this matters
In real projects, you do not ship a probability; you ship a decision. Calibration makes your model's predicted probabilities match reality (e.g., 0.7 means ~70% chance). Thresholding turns those probabilities into actions while balancing precision, recall, and business costs. Together, they reduce false alarms, missed cases, and wasted spend.
- Fraud detection: choose a threshold that keeps chargebacks low without swamping investigators.
- Medical triage: meet a minimum sensitivity while controlling unnecessary follow-ups.
- Marketing: match daily outreach capacity by selecting the top-scoring customers.
Quick Test note: anyone can take the test below. Logged-in users get saved progress.
Concept explained simply
Calibration answers: "When my model says 0.8, how often is it actually true?" A well-calibrated model's predicted probabilities align with observed frequencies. Thresholding answers: "At what probability do we act?"
Mental model
Think of weather forecasts. If a forecaster says "30% rain" on 100 days and it rains on ~30 of those, they are calibrated. If you decide to carry an umbrella when rain probability β₯ 0.5, that is thresholding. In ML, we want both: trustworthy probabilities and a threshold that fits cost/benefit and constraints.
Key metrics and tools
- Brier score: average squared error between predicted probability and outcome (0/1). Lower is better.
- Log loss: penalizes confident wrong predictions. Lower is better.
- Calibration curve (reliability diagram): compare predicted vs observed rate across bins.
- ECE (Expected Calibration Error): weighted average gap between predicted and observed rates across bins (smaller is better).
- ROC/PR curves: for threshold selection under different class balances.
Quick sanity checks you can do fast
- Bucket predictions into 5β10 bins and compute observed positive rate in each. Does it track the average predicted probability?
- Plot precision/recall vs threshold to see trade-offs.
- Stress-test thresholds against business constraints (e.g., max 50 alerts/day).
Methods for calibration
- Platt scaling: fit a logistic regression on model scores (or logits) to map them to calibrated probabilities.
- Isotonic regression: flexible, non-parametric monotonic mapping; great with enough validation data, can overfit on small data.
- Temperature scaling (for deep nets): scale logits by a single parameter; preserves ranking, adjusts confidence.
- Binning (histogram binning): average observed rate per bin; simple baseline.
How to choose a calibration method
- Small validation set: Platt scaling or temperature scaling (low-variance).
- Large validation set: isotonic regression can capture non-linear distortions.
- Keep a separate holdout set to verify improved ECE/Brier score after calibration.
When to calibrate
- When you will interpret probabilities directly (risk scores, ranking, cost-based thresholding).
- When class distribution has shifted or the model is overconfident (common in deep nets).
- After hyperparameter tuning, perform calibration on a fresh validation set to avoid leakage.
Threshold selection strategies
- Metric-driven: choose threshold that maximizes F1, Youden's J (TPR β FPR), or meets a target precision/recall.
- Cost-driven: use business costs. If cost(false positive)=C_FP and cost(false negative)=C_FN, classify positive when p β₯ C_FP / (C_FP + C_FN) on well-calibrated probabilities.
- Capacity-driven: pick threshold to produce a fixed number of positives per day (quantile on scores).
- Risk-driven: set separate thresholds by segment (e.g., stricter for high-risk groups) but review fairness and drift.
Decision rule with costs (plain language)
Predict positive when the expected cost of not acting (p Γ C_FN) exceeds the cost of acting when unnecessary ((1 β p) Γ C_FP). Rearranging gives p β₯ C_FP / (C_FP + C_FN).
Worked examples
Example 1 β Cost-optimal threshold for fraud alerts
Costs: C_FP = 1 (investigate unnecessarily), C_FN = 5 (missed fraud). With well-calibrated probabilities, decision threshold p* = 1 / (1 + 5) β 0.167. This is lower than 0.5 because missing fraud is costly.
Why this helps
Lowering the threshold increases recall, reducing expensive misses at the expense of more (cheaper) false alarms.
Example 2 β Calibration curve sanity check
Suppose for predictions around 0.5, the observed positive rate is 0.8. The model is under-confident in that region. Calibration (e.g., isotonic) can adjust the mapping so 0.5 scores better reflect reality.
Before vs after
- Before: ECE ~ 0.20, Brier score ~ 0.18
- After isotonic: ECE ~ 0.07, Brier score ~ 0.15
Numbers are illustrative; use a holdout set to confirm.
Example 3 β Meeting a clinical requirement
A triage model must reach sensitivity β₯ 0.95. Sweep thresholds, pick the smallest threshold where TPR β₯ 0.95 while recording precision. Report both and the expected daily positives to ensure staffing can handle the volume.
Step-by-step workflow
- Split data: train, validation (for threshold/calibration), and final holdout.
- Train model and compute predicted probabilities on validation set.
- Assess calibration: reliability diagram, ECE, Brier score.
- If needed, fit a calibration map (Platt, isotonic, or temperature) on validation.
- Re-evaluate ECE/Brier on a holdout to confirm improvement.
- Choose threshold: by metric target, cost rule, or capacity constraint.
- Stress-test: simulate volume, confusion matrix, and cost at the chosen threshold.
- Document: method, threshold, expected trade-offs, and monitoring plan.
Exercises
These exercises match the ones below the lesson. Do them here, then open solutions when stuck.
Exercise 1 β Cost-based threshold and cost comparison
Data (12 cases):
id p y 1 0.05 0 2 0.12 0 3 0.18 1 4 0.22 0 5 0.28 1 6 0.31 0 7 0.41 1 8 0.55 1 9 0.62 0 10 0.74 1 11 0.85 1 12 0.93 1 Costs: C_FP=1, C_FN=5
- Compute p* = C_FP / (C_FP + C_FN).
- Classify at thresholds 0.5 and p*; compute FP, FN, and total cost = FP*C_FP + FN*C_FN.
- Which threshold is cheaper?
Open a hint
At p* = 1 / 6 β 0.167, almost all cases β₯ 0.167 become positive; compare FP and FN counts.
Exercise 2 β Build a reliability table (ECE)
Using the same 12 cases, create 5 bins: [0β0.2), [0.2β0.4), [0.4β0.6), [0.6β0.8), [0.8β1.0]. For each bin compute:
- n (count), avg predicted p, observed positive rate.
- Gap = observed β avg predicted.
- ECE β sum over bins of (n/N) Γ |gap|.
Open a hint
Compute averages per bin; N = 12. Keep 2β3 decimals.
Checklist
- [ ] You calculated p* correctly.
- [ ] You reported FP, FN, and total cost at both thresholds.
- [ ] Your ECE is a single number between 0 and 1.
- [ ] You noted at least one bin where the model was over- or under-confident.
Common mistakes
- Assuming 0.5 is the "right" threshold. It rarely is; use costs, targets, or capacity.
- Calibrating on the same data used for model training and threshold tuning. Keep a clean validation or holdout set.
- Optimizing AUC then ignoring precision/recall at the chosen threshold. Always report operating-point metrics.
- Overfitting with isotonic regression on tiny datasets. Prefer Platt or temperature scaling when data is scarce.
- Forgetting prevalence shifts. Monitor calibration drift and re-calibrate when base rates change.
Self-check
- Did your chosen threshold meet the business target (e.g., precision β₯ X or alerts/day β€ Y)?
- Did calibration reduce ECE/Brier on a holdout set (not just on validation)?
- Are you logging predicted probabilities and outcomes to audit calibration over time?
Practical projects
- Alerting system: Build a simple classifier, calibrate it, then deploy a script that flags top-N cases per day based on a dynamic threshold.
- Cost simulator: Given C_FP and C_FN sliders, simulate expected cost across thresholds using holdout predictions.
- Drift monitor: Weekly reliability table and ECE trend; auto-warn when ECE exceeds a threshold.
Who this is for
Data Scientists and ML Engineers who need reliable probabilities and principled decisions from classification models.
Prerequisites
- Basics of classification metrics (precision, recall, ROC/PR curves).
- Understanding of train/validation/test splits and avoiding data leakage.
- Comfort with arrays, grouping, and simple statistics.
Learning path
- Before: Confusion matrix and ROC/PR fundamentals.
- Now: Calibration (curves, ECE) and thresholding strategies (metric-, cost-, capacity-driven).
- Next: Cost-sensitive learning, class imbalance handling, and post-deployment monitoring.
Next steps
- Take the Quick Test below. Anyone can take it; log in to save your progress.
- Apply calibration and threshold selection to your latest classification project and document the chosen operating point.
- Set up weekly monitoring for ECE and key metrics at the deployed threshold.
Mini challenge
Your model's validation metrics are: AUC=0.89, ECE=0.18 at threshold 0.5, precision=0.62, recall=0.71. The business requires precision β₯ 0.75 and can review at most 200 cases/day. Propose a plan in 4 steps to calibrate, re-select a threshold, and verify both the precision target and the 200/day limit. Keep it concise.