Who this is for
Computer Vision Engineers and ML practitioners who build image/instance/semantic segmentation systems and need dependable, comparable metrics to evaluate models.
Prerequisites
- Basic confusion-matrix terms: true positive (TP), false positive (FP), false negative (FN)
- Understanding of binary vs multi-class segmentation masks
- Ability to threshold model probabilities into binary masks
Why this matters
Real tasks you will face:
- Choosing a consistent metric to compare segmentation models across datasets and classes
- Deciding thresholding and averaging strategies for class-imbalanced data
- Handling edge cases (empty masks) without breaking dashboards or CI checks
- Diagnosing failure modes (over-segmentation vs under-segmentation) using TP/FP/FN patterns
Concept explained simply
Intersection over Union (IoU) and Dice coefficient measure overlap between prediction and ground-truth masks.
- IoU (Jaccard): IoU = TP / (TP + FP + FN)
- Dice (F1/Dice): Dice = 2TP / (2TP + FP + FN)
- Relation: Dice = 2 × IoU / (1 + IoU). Both range from 0 (no overlap) to 1 (perfect overlap).
Mental model
Think of two shapes on a canvas. IoU is overlap area divided by the total area covered by either shape. Dice is like a negotiated overlap score that rewards mutual agreement slightly more, often smoother and more forgiving of small boundary shifts.
When to favor each
- IoU: Common benchmark metric; stricter for small mismatches
- Dice: Often used as a loss or validation metric; smoother with small objects or fuzzy boundaries
Practical details you must get right
- Thresholding: Convert probabilities to binary masks (e.g., p >= 0.5). For comparison fairness, report the chosen threshold or sweep over thresholds.
- Averaging across classes:
- Macro: average the metric per class equally
- Weighted macro: weight by class frequency
- Micro: compute global TP/FP/FN across all classes first, then compute the metric
- Empty-mask cases:
- If ground truth and prediction are both empty: define IoU=1, Dice=1 (perfect agreement)
- If ground truth empty but prediction not: IoU=0, Dice=0 (false positive)
- Soft vs hard metrics: For training or calibration, you may use soft Dice with probabilities; for reporting, prefer thresholded hard masks for clarity.
- Smoothing epsilon: When denominators can be 0, add a tiny value (e.g., 1e-7) to avoid division by zero.
Worked examples
Example 1 — Binary segmentation
Given TP=1200, FP=300, FN=500:
- IoU = 1200 / (1200 + 300 + 500) = 1200 / 2000 = 0.60
- Dice = 2×1200 / (2×1200 + 300 + 500) = 2400 / 3200 = 0.75
- Relation check: 2×0.60 / (1 + 0.60) = 1.20 / 1.60 = 0.75
Example 2 — Empty ground truth and prediction
- GT empty, Pred empty → IoU=1, Dice=1 (perfect agreement)
- GT empty, Pred not empty → IoU=0, Dice=0 (all predicted pixels are FP)
Example 3 — Multi-class (macro and micro)
Three classes (ignore background). Per-class TP, FP, FN:
- Class A: TP=50, FP=10, FN=20 → IoU=50/80=0.625; Dice=100/(100+30)=0.769
- Class B: TP=30, FP=15, FN=15 → IoU=30/60=0.500; Dice=60/(60+30)=0.667
- Class C: TP=40, FP=20, FN=40 → IoU=40/100=0.400; Dice=80/(80+60)=0.571
Macro mIoU = (0.625 + 0.500 + 0.400)/3 = 0.508
Macro mDice = (0.769 + 0.667 + 0.571)/3 = 0.669
Micro totals: TP=120, FP=45, FN=75 → IoU=120/240=0.500; Dice=240/(240+120)=0.667
How to compute IoU and Dice (step-by-step)
- Prepare binary masks (per class if multi-class): threshold probabilities consistently.
- Count TP, FP, FN pixel-wise for each class.
- Compute IoU and Dice from counts. Add a small epsilon in denominators if needed.
- Choose averaging: macro, weighted, or micro. State the choice in reports.
- Handle empty-mask cases with clear conventions.
- Optionally sweep thresholds to see stability and choose an operating point.
Common mistakes and how to self-check
- Mixing background with foreground classes: Decide if background is a class; be consistent across training and evaluation.
- Reporting only a single number for a heavily imbalanced dataset: Also share per-class metrics or macro averages.
- Ignoring threshold sensitivity: Validate metrics at multiple thresholds or use PR/ROC analysis.
- Silent divide-by-zero: Always use epsilon; log cases where both masks are empty.
- Comparing soft Dice to hard IoU: Keep apples-to-apples; use the same mask type.
Self-check
- Can you explain the difference between macro, weighted macro, and micro?
- Do you have a written rule for empty-mask handling?
- Are thresholds and class mappings documented?
Exercises (hands-on)
Do these now. Then compare with the solutions below or in the exercises card.
Exercise 1 — Binary IoU and Dice
Given a binary segmentation task with TP=1200, FP=300, FN=500, compute IoU and Dice to two decimals.
- Show both the formula and your intermediate denominator values.
Exercise 2 — Multi-class macro and micro
For three classes (ignore background) with per-class counts:
- Class A: TP=50, FP=10, FN=20
- Class B: TP=30, FP=15, FN=15
- Class C: TP=40, FP=20, FN=40
Compute macro mIoU and mDice. Then compute micro IoU and Dice using totals across classes. Round to three decimals.
Checklist before you check solutions
- Wrote the exact formulas used
- Showed denominators for IoU and Dice
- Stated rounding rules
- Explained whether background is included
Practical projects
- Build a segmentation evaluation script: Given GT and predicted masks, output per-class IoU/Dice, macro/micro averages, and threshold sweep results.
- Error analysis dashboard: Visualize FP hot spots by overlaying masks and sorting images by lowest IoU.
- Calibration study: Compare metrics at thresholds from 0.3 to 0.7 and pick an operating point that balances FP and FN for your use case.
Learning path
- Before this: Confusion matrix fundamentals; segmentation basics
- Now: IoU and Dice metrics, averaging choices, and edge cases
- Next: Calibration, precision/recall curves for segmentation, and panoptic metrics if needed
Next steps
- Compute both IoU and Dice for your current project and compare macro vs micro trends
- Decide and document your empty-mask policy
- Run a small threshold sweep and plot metric vs threshold
Mini challenge
You are evaluating a medical segmentation model with many tiny lesions and severe class imbalance. What metric and averaging would you report as the main number, and what two supporting plots would you share with stakeholders? Justify briefly.
When you are ready, take the Quick Test below. Everyone can take it for free; only logged-in users will have their progress saved.