Why this matters
As a Computer Vision Engineer, you will ship object detectors that must be reliable in the real world. Metrics like IoU, AP, and mAP tell you how good your detector is at finding and localizing objects. Product decisions (e.g., whether a model is ready to deploy) often hinge on these metrics. Typical tasks include choosing a confidence threshold for alerts, comparing model variants, and diagnosing localization errors when precision or recall falls.
Who this is for
- Engineers training or evaluating object detection models.
- Data scientists comparing model checkpoints and datasets.
- ML practitioners preparing results for stakeholders.
Prerequisites
- Basic understanding of supervised learning (classification concepts like precision and recall).
- Familiarity with bounding boxes in detection (x1, y1, x2, y2).
- Ability to read simple PR curves and sort predictions by confidence.
Learning path
- IoU and matching rules (TP/FP/FN).
- Precision, recall, and PR curves.
- AP (Average Precision) for one class and one IoU threshold.
- mAP across classes and across IoU thresholds (VOC vs COCO).
- Interpreting results and avoiding common pitfalls.
Concept explained simply
Intersection over Union (IoU) measures overlap between a predicted and a ground-truth bounding box: IoU = area of overlap divided by area of union. If IoU is above a threshold (e.g., 0.50) and the predicted class is correct and the ground-truth box is still unmatched, the prediction counts as a True Positive (TP). Otherwise, it is a False Positive (FP). Any ground-truth box not matched becomes a False Negative (FN).
Precision is the fraction of your positive predictions that are correct. Recall is the fraction of ground truths you found. As you vary the confidence threshold, you trace out a precision–recall (PR) curve. Average Precision (AP) summarizes that curve as one number (area under the PR curve). Mean Average Precision (mAP) averages AP across classes and sometimes across multiple IoU thresholds.
Mental model
- IoU is a ruler for overlap quality.
- AP is how cleanly your ranked list of predictions aligns with actual objects.
- mAP is the report card across all classes and, in COCO, across stricter IoU thresholds too.
Core definitions and matching rules
- IoU = area(intersection) / area(union).
- Match a prediction to the highest-IoU ground-truth of the same class if IoU ≥ threshold and that ground-truth is not already matched.
- One ground-truth can match at most one prediction. Extra predictions on the same object are FPs (duplicates).
- Compute PR by sorting predictions by confidence (highest first) over the full evaluation set, then sweeping down the list.
- AP: area under the PR curve. VOC 2007 used 11-point interpolation; modern practice uses all points with monotonic precision envelope.
- mAP@0.5 (VOC-style): AP averaged across classes at IoU=0.50.
- COCO mAP: AP averaged across classes and IoU thresholds from 0.50 to 0.95 in steps of 0.05.
Worked examples
Example 1: Compute IoU between two boxes
GT: [10,10,60,60], Pred: [12,12,58,58].
- GT area = (60-10)*(60-10) = 50*50 = 2500
- Pred area = (58-12)*(58-12) = 46*46 = 2116
- Intersection = 2116 (pred is fully inside GT with margins)
- Union = 2500 + 2116 - 2116 = 2500
- IoU = 2116 / 2500 = 0.8464
At IoU threshold 0.5, this is a match.
Example 2: TP/FP and AP at IoU=0.5 for a toy dataset
Dataset (class: person). GT boxes (4 total):
- img1: A [10,10,60,60]
- img2: B [20,20,80,80]
- img3: C [5,5,40,40], D [50,50,90,90]
Predictions (confidence):
- img1: p1 [12,12,58,58] 0.9; p2 [70,70,90,90] 0.6
- img2: p3 [18,18,65,65] 0.8; p4 [20,20,80,80] 0.4
- img3: p5 [6,6,38,38] 0.85; p6 [48,48,92,92] 0.7; p7 [10,50,40,90] 0.3
Compute IoU@0.5 matches (greedy by confidence): p1-A TP, p5-C TP, p3-B TP, p6-D TP; p2 FP, p4 FP (duplicate on B), p7 FP. Sort by confidence yields T,T,T,T,F,F,F.
PR points (cumulative):
- 1: P=1.00, R=0.25
- 2: P=1.00, R=0.50
- 3: P=1.00, R=0.75
- 4: P=1.00, R=1.00
- 5: P=0.80, R=1.00
- 6: P=0.67, R=1.00
- 7: P=0.57, R=1.00
AP@0.5 ≈ 1.00 (all TPs precede FPs).
Example 3: Effect of IoU threshold on AP (0.75 vs 0.5)
Using the same predictions but IoU=0.75: p1 TP, p5 TP, p3 FP (IoU≈0.535), p6 TP, p2 FP, p4 TP (perfect match to B), p7 FP. Sequence: T, T, F, T, F, T, F.
- PR cumulative: (1,0), (2,0), (2,1), (3,1), (3,2), (4,2), (4,3) in (TP, FP)
- Precision/Recall: (1.00,0.25), (1.00,0.50), (0.67,0.50), (0.75,0.75), (0.60,0.75), (0.67,1.00), (0.57,1.00)
- AP@0.75 (11-pt interp) ≈ 0.86
COCO mAP averages AP over IoU thresholds 0.50:0.95. Expect lower numbers than AP@0.5 because higher IoUs are stricter.
How to compute AP (practical)
- Gather all predictions with class, confidence, and box; gather ground-truth boxes with class.
- Filter to a class (for AP of that class).
- Sort predictions by confidence descending.
- Greedy matching: for each prediction in order, match to the unmatched GT with highest IoU if IoU ≥ threshold; otherwise FP.
- Track cumulative TP and FP; compute precision = TP/(TP+FP) and recall = TP/total_GT at each step.
- Make precision monotonic non-increasing by taking the maximum precision to the right at each recall.
- Integrate area under the PR curve (sum over recall steps or 11-point method) to get AP.
Choosing metrics and thresholds
- VOC-style: report AP@0.5 for each class and mean across classes.
- COCO-style: report AP@[0.50:0.95], and optionally AP_small/medium/large to understand scale sensitivity.
- Set an operating confidence threshold for deployment using the PR curve: balance precision vs recall based on business cost of FPs vs FNs.
Common mistakes and self-check
- Using accuracy for detection. Self-check: Are you reporting AP/mAP instead of accuracy?
- Double-matching a GT to multiple predictions. Self-check: After one match, mark GT as matched.
- Per-image sorting by confidence. Self-check: Sort globally across the dataset for AP.
- Ignoring class labels in matching. Self-check: Only match when predicted class equals GT class.
- Not handling duplicates. Self-check: Extra predictions on a matched GT count as FP.
- Comparing AP@0.5 to COCO mAP directly. Self-check: Be explicit about IoU thresholds used.
Exercises
These mirror the practice tasks below. Try them here, then check the detailed solutions in each exercise card.
- Exercise 1: Compute IoU, TP/FP, precision, recall, and AP@0.5 for a small dataset.
- Exercise 2: Re-evaluate the same predictions at IoU=0.75 and compare AP and mAP.
- Checklist before you compute:
- Have you sorted predictions by confidence globally?
- Are you matching each GT at most once?
- Are class labels required to match?
- Are you using the correct IoU threshold?
- Did you make precision monotonic before integrating for AP?
Practical projects
- Build a small evaluator: Given JSON of GT and predictions, compute AP@0.5 and COCO-style AP@[0.50:0.95] for one class.
- Threshold tuner: Plot PR curve and pick a deployment threshold for two different use-cases (high precision vs high recall).
- Error heatmap: Identify top FP types (background vs duplicate vs class confusion) and propose two data fixes.
Mini challenge
You improved localization but see AP@0.5 unchanged while AP@[0.75] increased. Explain in one paragraph how this can happen and whether it matters for your use-case. Suggest a next change to the training or post-processing that aligns with your goal.
Next steps
- Deepen error analysis: break down AP by object size and crowding.
- Investigate confidence calibration to better align scores with precision.
- Evaluate per-class AP to prioritize data collection for low-performing classes.
About the quick test and saving progress
The quick test is available to everyone for free. If you are logged in, your progress will be saved automatically so you can resume later.