Topic Not Found

Why this matters

In computer vision, your models output scores: class probabilities for classifiers, pixel-wise probabilities for segmentation, and confidence scores for detectors. Two critical questions follow: Can you trust those scores (calibration)? And where should you cut off to make a decision (thresholding)? Getting this right improves on-call decisions, reduces false alarms in production, and makes business trade-offs explicit.

Quality control: Trigger a stop-line alert only when the defect probability is truly above 95%.
Medical imaging: Meet minimum sensitivity while controlling false positives per image.
Autonomy and safety: Calibrate detection confidences so 0.7 means ~70% chance the object is real.

Concept explained simply

Calibration aligns model scores with real-world likelihoods. A calibrated 0.8 score means that, across many similar cases, about 80% are actually positive. Thresholding is choosing the cut-off on those scores to turn them into actions (positive/negative; keep/discard a detection).

Mental model

Think of your model as a weather forecaster. Calibration makes the forecast honest (when it says 30% rain, it rains on 30% of such days). Thresholding is packing an umbrella if forecast ≥ your personal cut-off (say 50%). Calibrate first, then set the umbrella rule.

Core workflow

Hold out calibration data. Reserve a validation split only for calibration and threshold tuning (never the test set).
Measure calibration. Use reliability diagrams, Expected Calibration Error (ECE), and proper scores like Brier score or NLL.
Calibrate. Apply a monotonic mapping such as Temperature Scaling (logit/T), Platt scaling (logistic), or Isotonic Regression. For multi-class, temperature scaling is a simple and strong baseline.
Choose thresholds. Optimize a business-relevant target: F1 on PR curve, Youden’s J on ROC, or expected cost using class priors and misclassification costs.
Validate stability. Check performance across batches, lighting conditions, or camera types. Refit if distribution drifts.

Checklist:
- Separate split for calibration/thresholding
- Reliability diagram/ECE computed
- Method picked: Temp/Platt/Isotonic
- Threshold optimized for target metric/cost
- Subgroup analysis done

Worked examples

Example 1 — Binary defect detection (F1 target)

Scores on validation (8 parts): [0.95, 0.80, 0.60, 0.55, 0.52, 0.40, 0.35, 0.20], labels: [1,1,0,1,0,0,1,0]. Test thresholds and compute F1. Best F1 = 0.75 at threshold 0.55 (TP=3, FP=1, FN=1, TN=3). Deploy threshold=0.55.

Example 2 — Semantic segmentation (pixel masks)

Step 1: Calibrate pixel probabilities with isotonic regression on a held-out set.

Step 2: Tune pixel threshold to maximize Dice/F1 on validation. Add postprocessing: minimum region size and hole filling. Result: threshold 0.42 with Dice +3.1 points, fewer tiny false blobs.

Example 3 — Object detection (precision target)

Goal: ≥95% precision for safety warnings. Tune two knobs jointly: score threshold and NMS IoU. Use validation PR curves per class; pick class-wise thresholds to hit precision ≥95% while keeping recall acceptable. Calibrate the classification head with temperature scaling to make 0.95 scores meaningfully precise.

Who this is for and prerequisites

Who this is for: Computer Vision Engineers shipping classifiers, segmenters, or detectors to production.
Prerequisites:
- Basic probability, ROC/PR curves, confusion matrix
- Ability to run validation loops and save logits/scores
- Familiarity with your framework’s inference outputs (logits vs probabilities)

Learning path

Refresh evaluation metrics: ROC, PR, F1, AUROC, mAP.
Learn calibration metrics: reliability diagrams, ECE, Brier score, NLL.
Apply calibration methods: Temperature Scaling, Platt, Isotonic.
Tune thresholds for your objective: F1, constraint-based (precision ≥ X), or cost-sensitive.
Validate across subgroups and monitor drift in production.

Methods in brief

Temperature Scaling: divide logits by T; T>1 softens, T<1 sharpens; preserves ranking.
Platt Scaling: logistic regression on the model score/logit.
Isotonic Regression: non-parametric, monotonic mapping; great with enough data.
Cost-based threshold (with calibrated probabilities): predict positive if p ≥ C_FP / (C_FP + C_FN).
Imbalance tip: prefer PR-based optimization over accuracy or AUROC when positives are rare.

Exercises you can do now

These mirror the Exercises section below. Do them on paper or in a notebook. Then open the solutions inside each exercise when ready.

Exercise 1 — Pick an F1-optimal threshold (binary)
- Predicted probabilities: [0.95, 0.80, 0.60, 0.55, 0.52, 0.40, 0.35, 0.20]
- True labels: [1,1,0,1,0,0,1,0]
- Task: Find the threshold (using ≥ rule) that maximizes F1. Report threshold and confusion matrix.
Exercise 2 — Temperature scaling with a tiny grid
- Logits: [2.2, 1.0, 0.2, -0.2, -1.0, -2.0], Labels: [1,1,1,0,0,0]
- Try T in {0.5, 1.0, 1.5, 2.0}. For each T: p = sigmoid(logit/T), compute mean NLL. Choose the best T.
- Using 3 ECE bins ([0–0.33), [0.33–0.66), [0.66–1]): compute ECE before (T=1) and after (best T).

Self-check checklist:
- Used ≥ threshold rule consistently
- Computed precision/recall/F1 correctly
- Mean NLL compared across T, not per-sample
- ECE weighted by bin frequency

Common mistakes and how to self-check

Tuning on the test set → data leakage. Always use a separate validation split.
Assuming softmax outputs are calibrated. They often aren’t; measure ECE/Brier first.
Optimizing accuracy on imbalanced data. Prefer PR-based targets (F1/Fβ) or cost-based thresholds.
One-size-fits-all thresholds. Use per-class thresholds for multi-class or per-category detectors.
Calibrating on too little data with isotonic. Use cross-validation or prefer temperature scaling.
Ignoring shift. Re-check calibration when lighting/cameras/tasks change.

Practical projects

Recalibrate a classifier: Save logits on a validation set, fit temperature scaling, and show ECE and reliability diagram before/after.
Segmentation threshold tuner: Sweep pixel thresholds, pick the best Dice, and add minimum-area filtering. Report before/after metrics.
Detection thresholding: Jointly tune score threshold and NMS IoU to meet a precision target; show the trade-off curve.

Mini challenge

You have a calibrated probability p for “defect.” The cost of a false positive is 1, and the cost of a false negative is 4. What threshold minimizes expected cost? Compute it and explain why.

Next steps

Apply temperature scaling to your current CV model; record NLL/ECE deltas.
Define a clear business rule (e.g., “precision ≥ 95%”) and tune thresholds to meet it.
Set up a periodic calibration check on production data.

Quick Test

Take the short test below to check your understanding. Available to everyone. Note: only logged-in learners have their progress saved.

Menu

Calibration And Thresholding

Table of Contents

Why this matters

Concept explained simply

Core workflow

Worked examples

Who this is for and prerequisites

Learning path

Methods in brief

Exercises you can do now

Common mistakes and how to self-check

Practical projects

Mini challenge

Next steps

Quick Test

Practice Exercises

F1-optimal threshold on a small dataset

Instructions

Expected Output

Temperature scaling via tiny grid search

Calibration And Thresholding — Quick Test

Have questions about Calibration And Thresholding?

AI Assistant