Topic Not Found

Why this matters

Good calibration means model probabilities match real-world frequencies. In production, this unlocks smarter thresholds, safer automation, and better user trust. You will use calibration to:

Set decision thresholds (e.g., when to auto-approve vs. escalate).
Estimate risk and expected costs reliably (fraud, medical, safety).
Trigger human-in-the-loop only when confidence is truly high/low.
Compare model versions beyond accuracy or AUC.

Who this is for

Machine Learning Engineers deploying classification or probabilistic regression.
Data Scientists responsible for model monitoring and alerting.
Product/ML Ops folks defining model SLOs and dashboards.

Prerequisites

Basic classification metrics (accuracy, precision/recall, AUC).
Understanding of probabilities and confidence intervals.
Know that calibration does not change ranking; it changes predicted probability scale.

Concept explained simply

A model is well-calibrated if among predictions with probability p, about p of them are truly positive. Example: Of all cases with predicted 0.7, roughly 70% should be positive. Calibration is about truthful confidence, not about who is ranked above whom.

Mental model

Think of a weather app: If it says 30% rain on 100 days, it should rain about 30 of those days. If it rains 60 of those days, the app was underconfident; if it rains 10, it was overconfident. Your classifier works the same way.

Key terms (tap to expand)

Reliability diagram: Plot confidence (x) vs. empirical accuracy (y). Perfect calibration lies on the 45° line.
ECE (Expected Calibration Error): Weighted average gap between confidence and accuracy across bins.
MCE (Maximum Calibration Error): Largest per-bin gap.
Brier score: Mean squared error between probabilities and outcomes (mixes discrimination and calibration).
Temperature scaling: Post-processing that rescales logits to fix over/underconfidence without changing ranking.
Classwise ECE: ECE computed per class (important for imbalanced tasks).
PIT histogram (regression): Checks calibration of predictive distributions; uniform is ideal.

Core metrics and tools

ECE (fixed-width bins or adaptive bins). Track overall and by segment (device, locale, cohort).
Classwise ECE for multiclass; one-vs-rest for each class.
Reliability diagrams (overall and per-segment).
Brier score and Log Loss as supporting signals (not pure calibration).
Regression: PIT histogram and CRPS (continuous ranked probability score).

Sampling and statistical care

Don’t trust tiny bins. Use minimum sample per bin (e.g., ≥ 300) or use adaptive binning.
Show uncertainty: Wilson intervals for bin accuracy; wide intervals imply “inconclusive.”
Choose monitoring windows (daily/weekly) that reach stable sample sizes.
Alert on deltas (change from baseline) plus absolute thresholds to avoid noisy alerts.

Operational monitoring plan

Define SLOs
- Overall ECE ≤ 0.05 weekly.
- Delta ECE vs. baseline ≤ 0.02.
- No segment ECE > 0.08 for two consecutive windows.
Dashboards
- Reliability diagram (overall + critical segments).
- ECE time series (overall, per class).
- Volume per bin to sanity-check sample sizes.
Guardrails
- Suppress alerts if any bin has n < threshold (e.g., 200).
- Show confidence intervals; annotate “needs more data.”
Actions
- Small miscalibration: temperature scaling; recalibrate on recent data.
- Segment-only issues: segment-specific calibration layers or feature fixes.
- Shifted base rates: recalibrate with recent priors; revisit thresholds.

Worked examples

Example 1: Compute ECE

Binary classifier with 12 predictions, 5 equal-width bins.

Data and calculation

Predicted p: [0.05, 0.12, 0.18, 0.27, 0.33, 0.41, 0.55, 0.62, 0.76, 0.83, 0.91, 0.97]
True y: [0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1]
Bins: [0–0.2), [0.2–0.4), [0.4–0.6), [0.6–0.8), [0.8–1.0]

Bin1: mean p ≈ 0.117, acc = 0/3 = 0.000, contrib ≈ 0.029
Bin2: mean p = 0.300, acc = 1/2 = 0.500, contrib ≈ 0.033
Bin3: mean p = 0.480, acc = 1/2 = 0.500, contrib ≈ 0.003
Bin4: mean p = 0.690, acc = 2/2 = 1.000, contrib ≈ 0.052
Bin5: mean p ≈ 0.903, acc = 3/3 = 1.000, contrib ≈ 0.024

ECE ≈ 0.142.

Example 2: Reading a reliability diagram

If the curve is mostly below the diagonal, predictions are overconfident (confidence > accuracy). Above the diagonal means underconfident.

Interpretation tips

High-confidence bins (0.8–1.0) below the line: dangerous overconfidence; prioritize fixing this.
Mid bins zig-zag: often data scarcity; check error bars before acting.

Example 3: Temperature scaling effect

You train a classifier whose validation ECE is 0.085. Applying temperature scaling (optimizing one scalar T on validation NLL) reduces ECE to 0.034 without changing AUC. Thresholds now better reflect actual risk.

What changed?

Ranking unchanged; probabilities moved closer to observed frequencies.
Downstream policies (auto-approve if p>0.9) become safer.

Exercises

These mirror the graded exercises below. Do them here first, then submit in the Exercises section.

Exercise 1 (ECE by hand)

Use the 12 predictions above and 5 equal-width bins to compute ECE. Show each bin’s mean p, accuracy, and weighted contribution. Round to 3 decimals.

Exercise 2 (Alert or not?)

You have weekly monitoring with bins sized adequately. For each bin, you have mean predicted confidence and empirical accuracy:
Counts n: [600, 800, 1200, 2000, 1400] (total 6000)
Mean p: [0.14, 0.31, 0.49, 0.73, 0.92]
Acc: [0.08, 0.27, 0.52, 0.71, 0.84]
Baseline ECE last month: 0.040. Alert rule: trigger if delta ECE ≥ 0.030. Compute current ECE and decide alert.

Checklist before submitting:
- Used correct bin boundaries and weights.
- Showed each bin’s contribution.
- Rounded consistently.
- If samples are small, noted uncertainty.

Common mistakes and self-check

Confusing calibration with accuracy: A model can be accurate but overconfident. Self-check: inspect reliability diagram and ECE together.
Ignoring sample size: Spiky curves from tiny bins are not evidence. Self-check: show error bars; enforce min n per bin.
Relying only on overall ECE: Segment-specific miscalibration can hide. Self-check: track classwise and per-segment ECE.
Forgetting base-rate shift: When priors change, recalibrate with recent data. Self-check: compare current label rate vs. training.
Chasing noise with constant retrains: Act only when deltas exceed thresholds for consecutive windows.

Practical projects

Build a calibration dashboard: reliability diagram, ECE time series, classwise ECE, and bin sample sizes.
Implement temperature scaling service: re-fit T weekly on fresh validation batches and version-control the scaler.
Segment drill-down: choose two critical segments (e.g., geography, device) and establish segment SLOs and alerts.

Learning path

Understand calibration basics and ECE.
Practice with reliability diagrams and per-segment analysis.
Apply temperature scaling and isotonic regression; compare.
Define SLOs and implement monitoring with guardrails.
Run A/B for recalibration strategies; adopt the safest stable one.

Next steps

Calibrate your current production model on a recent validation set.
Add ECE and reliability diagrams to your monitoring dashboard.
Set alert thresholds and minimum sample rules.

Mini challenge

You inherit a model with baseline ECE 0.025. Last two weeks ECE: 0.055 and 0.058. The 0.8–1.0 bin shows biggest gap and volume increased 40%. Propose a 3-step plan to fix and validate, including a rollback condition.

Quick Test note

The quick test below is available to everyone. Only logged-in users will have their progress saved.

Menu

Calibration Monitoring Basics

Table of Contents