Why this matters
In computer vision, your models output scores: class probabilities for classifiers, pixel-wise probabilities for segmentation, and confidence scores for detectors. Two critical questions follow: Can you trust those scores (calibration)? And where should you cut off to make a decision (thresholding)? Getting this right improves on-call decisions, reduces false alarms in production, and makes business trade-offs explicit.
- Quality control: Trigger a stop-line alert only when the defect probability is truly above 95%.
- Medical imaging: Meet minimum sensitivity while controlling false positives per image.
- Autonomy and safety: Calibrate detection confidences so 0.7 means ~70% chance the object is real.
Concept explained simply
Calibration aligns model scores with real-world likelihoods. A calibrated 0.8 score means that, across many similar cases, about 80% are actually positive. Thresholding is choosing the cut-off on those scores to turn them into actions (positive/negative; keep/discard a detection).
Mental model
Think of your model as a weather forecaster. Calibration makes the forecast honest (when it says 30% rain, it rains on 30% of such days). Thresholding is packing an umbrella if forecast ≥ your personal cut-off (say 50%). Calibrate first, then set the umbrella rule.
Core workflow
- Hold out calibration data. Reserve a validation split only for calibration and threshold tuning (never the test set).
- Measure calibration. Use reliability diagrams, Expected Calibration Error (ECE), and proper scores like Brier score or NLL.
- Calibrate. Apply a monotonic mapping such as Temperature Scaling (logit/T), Platt scaling (logistic), or Isotonic Regression. For multi-class, temperature scaling is a simple and strong baseline.
- Choose thresholds. Optimize a business-relevant target: F1 on PR curve, Youden’s J on ROC, or expected cost using class priors and misclassification costs.
- Validate stability. Check performance across batches, lighting conditions, or camera types. Refit if distribution drifts.
- Checklist:
- Separate split for calibration/thresholding
- Reliability diagram/ECE computed
- Method picked: Temp/Platt/Isotonic
- Threshold optimized for target metric/cost
- Subgroup analysis done
Worked examples
Example 1 — Binary defect detection (F1 target)
Scores on validation (8 parts): [0.95, 0.80, 0.60, 0.55, 0.52, 0.40, 0.35, 0.20], labels: [1,1,0,1,0,0,1,0]. Test thresholds and compute F1. Best F1 = 0.75 at threshold 0.55 (TP=3, FP=1, FN=1, TN=3). Deploy threshold=0.55.
Example 2 — Semantic segmentation (pixel masks)
Step 1: Calibrate pixel probabilities with isotonic regression on a held-out set.
Step 2: Tune pixel threshold to maximize Dice/F1 on validation. Add postprocessing: minimum region size and hole filling. Result: threshold 0.42 with Dice +3.1 points, fewer tiny false blobs.
Example 3 — Object detection (precision target)
Goal: ≥95% precision for safety warnings. Tune two knobs jointly: score threshold and NMS IoU. Use validation PR curves per class; pick class-wise thresholds to hit precision ≥95% while keeping recall acceptable. Calibrate the classification head with temperature scaling to make 0.95 scores meaningfully precise.
Who this is for and prerequisites
- Who this is for: Computer Vision Engineers shipping classifiers, segmenters, or detectors to production.
- Prerequisites:
- Basic probability, ROC/PR curves, confusion matrix
- Ability to run validation loops and save logits/scores
- Familiarity with your framework’s inference outputs (logits vs probabilities)
Learning path
- Refresh evaluation metrics: ROC, PR, F1, AUROC, mAP.
- Learn calibration metrics: reliability diagrams, ECE, Brier score, NLL.
- Apply calibration methods: Temperature Scaling, Platt, Isotonic.
- Tune thresholds for your objective: F1, constraint-based (precision ≥ X), or cost-sensitive.
- Validate across subgroups and monitor drift in production.
Methods in brief
- Temperature Scaling: divide logits by T; T>1 softens, T<1 sharpens; preserves ranking.
- Platt Scaling: logistic regression on the model score/logit.
- Isotonic Regression: non-parametric, monotonic mapping; great with enough data.
- Cost-based threshold (with calibrated probabilities): predict positive if p ≥ C_FP / (C_FP + C_FN).
- Imbalance tip: prefer PR-based optimization over accuracy or AUROC when positives are rare.
Exercises you can do now
These mirror the Exercises section below. Do them on paper or in a notebook. Then open the solutions inside each exercise when ready.
- Exercise 1 — Pick an F1-optimal threshold (binary)
- Predicted probabilities: [0.95, 0.80, 0.60, 0.55, 0.52, 0.40, 0.35, 0.20]
- True labels: [1,1,0,1,0,0,1,0]
- Task: Find the threshold (using ≥ rule) that maximizes F1. Report threshold and confusion matrix.
- Exercise 2 — Temperature scaling with a tiny grid
- Logits: [2.2, 1.0, 0.2, -0.2, -1.0, -2.0], Labels: [1,1,1,0,0,0]
- Try T in {0.5, 1.0, 1.5, 2.0}. For each T: p = sigmoid(logit/T), compute mean NLL. Choose the best T.
- Using 3 ECE bins ([0–0.33), [0.33–0.66), [0.66–1]): compute ECE before (T=1) and after (best T).
- Self-check checklist:
- Used ≥ threshold rule consistently
- Computed precision/recall/F1 correctly
- Mean NLL compared across T, not per-sample
- ECE weighted by bin frequency
Common mistakes and how to self-check
- Tuning on the test set → data leakage. Always use a separate validation split.
- Assuming softmax outputs are calibrated. They often aren’t; measure ECE/Brier first.
- Optimizing accuracy on imbalanced data. Prefer PR-based targets (F1/Fβ) or cost-based thresholds.
- One-size-fits-all thresholds. Use per-class thresholds for multi-class or per-category detectors.
- Calibrating on too little data with isotonic. Use cross-validation or prefer temperature scaling.
- Ignoring shift. Re-check calibration when lighting/cameras/tasks change.
Practical projects
- Recalibrate a classifier: Save logits on a validation set, fit temperature scaling, and show ECE and reliability diagram before/after.
- Segmentation threshold tuner: Sweep pixel thresholds, pick the best Dice, and add minimum-area filtering. Report before/after metrics.
- Detection thresholding: Jointly tune score threshold and NMS IoU to meet a precision target; show the trade-off curve.
Mini challenge
You have a calibrated probability p for “defect.” The cost of a false positive is 1, and the cost of a false negative is 4. What threshold minimizes expected cost? Compute it and explain why.
Next steps
- Apply temperature scaling to your current CV model; record NLL/ECE deltas.
- Define a clear business rule (e.g., “precision ≥ 95%”) and tune thresholds to meet it.
- Set up a periodic calibration check on production data.
Quick Test
Take the short test below to check your understanding. Available to everyone. Note: only logged-in learners have their progress saved.