Why this skill matters for Computer Vision Engineers
Great models fail in the real world when evaluation is shallow. As a Computer Vision Engineer, robust evaluation and error analysis help you ship models that work across lighting, weather, devices, geographies, and use cases. You will know what to improve, when to recalibrate, and how to prioritize data collection. This skill unlocks confident model launches, faster iteration, and clear communication with teammates and stakeholders.
What you will be able to do
- Measure classification (Top-1/Top-5), detection (mAP/IoU), and segmentation (IoU/Dice) correctly.
- Calibrate probabilities, pick thresholds, and balance precision vs recall for your product goals.
- Analyze performance by slices (e.g., night vs day) to find brittle spots.
- Test robustness under lighting and weather variations.
- Run confusion and failure mode analyses to drive fixes.
- Organize human review and label audits to reduce annotation bias and noise.
Who this is for
- Computer Vision Engineers building classification, detection, or segmentation systems.
- ML Engineers and Data Scientists deploying vision models on edge or cloud.
- Tech leads needing dependable evaluation to guide roadmap and releases.
Prerequisites
- Comfort with Python and NumPy.
- Basic understanding of supervised learning and loss functions.
- Familiarity with at least one CV task (classification, detection, or segmentation).
Learning path
- Foundations of metrics: Learn Top-1/Top-5, IoU, Dice, AP/mAP; compute them on small toy sets.
- Calibration and thresholds: Plot precision–recall trade-offs; select operating points; compute ECE.
- Slice-based analysis: Add metadata (lighting, weather); measure per-slice metrics; track the worst slice.
- Robustness testing: Apply controlled augmentations (brightness, blur, rain) and re-evaluate.
- Failure modes and label audits: Build a confusion matrix, tag errors, sample and review labels.
Quick guidance: When to use which metric?
- Image classification: Top-1/Top-5 accuracy, ROC/PR curves if probabilities matter.
- Object detection: IoU for matching boxes, AP/mAP across IoU thresholds.
- Segmentation: IoU and Dice; use Dice when the foreground is tiny.
Worked examples
1) Classification Top-1 and Top-5
Compute Top-1 and Top-5 accuracy on a toy batch.
import numpy as np
# Ground truth classes (0..9)
y_true = np.array([2, 0, 5, 7])
# Model predicted class probabilities (batch_size x num_classes)
np.random.seed(0)
probs = np.array([
[0.01,0.02,0.60,0.10,0.05,0.07,0.03,0.06,0.04,0.02], # correct class 2
[0.30,0.25,0.05,0.05,0.10,0.02,0.08,0.03,0.06,0.06], # correct class 0
[0.05,0.07,0.10,0.10,0.10,0.35,0.05,0.06,0.06,0.06], # correct class 5
[0.05,0.06,0.05,0.05,0.05,0.05,0.10,0.48,0.05,0.06], # correct class 7
])
pred_top1 = probs.argmax(axis=1)
acc_top1 = (pred_top1 == y_true).mean()
# Top-5: check if true class is in top 5 probs
sorted_idx = np.argsort(-probs, axis=1)
top5 = sorted_idx[:, :5]
acc_top5 = np.mean([y_true[i] in top5[i] for i in range(len(y_true))])
print(f"Top-1 accuracy: {acc_top1:.2f}")
print(f"Top-5 accuracy: {acc_top5:.2f}")
2) IoU and AP for object detection (single class, IoU@0.5)
Match predictions to ground truth using IoU. Then compute AP by sorting detections by score.
import numpy as np
def iou(boxA, boxB):
# boxes: [x1, y1, x2, y2]
xA = max(boxA[0], boxB[0])
yA = max(boxA[1], boxB[1])
xB = min(boxA[2], boxB[2])
yB = min(boxA[3], boxB[3])
inter = max(0, xB - xA) * max(0, yB - yA)
areaA = (boxA[2]-boxA[0])*(boxA[3]-boxA[1])
areaB = (boxB[2]-boxB[0])*(boxB[3]-boxB[1])
union = areaA + areaB - inter
return inter / union if union > 0 else 0.0
# Ground truth boxes per image
GT = {
0: [np.array([20,20,80,80])],
1: [np.array([15,15,60,60])],
}
# Predictions: (img_id, box, score)
P = [
(0, np.array([18,18,82,82]), 0.9), # good
(0, np.array([0,0,30,30]), 0.6), # false positive
(1, np.array([10,10,58,58]), 0.8), # good
]
P.sort(key=lambda x: -x[2])
TP, FP = [], []
assigned = {k: [False]*len(v) for k,v in GT.items()}
for img_id, box, score in P:
gts = GT[img_id]
ious = [iou(box, g) for g in gts]
j = int(np.argmax(ious))
if ious[j] >= 0.5 and not assigned[img_id][j]:
TP.append(1); FP.append(0); assigned[img_id][j] = True
else:
TP.append(0); FP.append(1)
TP = np.array(TP); FP = np.array(FP)
cumTP = np.cumsum(TP)
cumFP = np.cumsum(FP)
prec = cumTP / np.maximum(cumTP + cumFP, 1)
rec = cumTP / sum(len(v) for v in GT.values())
# Interpolated AP (11-point or continuous). Here: continuous.
def average_precision(rec, prec):
# Make precision non-increasing
for i in range(len(prec)-2, -1, -1):
prec[i] = max(prec[i], prec[i+1])
# Integrate by rectangles
r = np.concatenate(([0.0], rec, [1.0]))
p = np.concatenate(([prec[0]], prec, [0.0]))
return np.sum((r[1:] - r[:-1]) * p[1:])
ap = average_precision(rec.copy(), prec.copy())
print("Precision:", prec)
print("Recall:", rec)
print(f"AP@0.5: {ap:.3f}")
3) Segmentation metrics: IoU and Dice
Compute IoU and Dice for binary masks.
import numpy as np
# Ground truth and prediction masks (H x W) with values {0,1}
GT = np.array([[1,1,0],[0,1,0],[0,0,0]], dtype=np.uint8)
PR = np.array([[1,0,0],[0,1,1],[0,0,0]], dtype=np.uint8)
intersection = np.logical_and(GT, PR).sum()
union = np.logical_or(GT, PR).sum()
IoU = intersection / union if union else 1.0
sum_pixels = GT.sum() + PR.sum()
Dice = (2 * intersection) / sum_pixels if sum_pixels else 1.0
print(f"IoU: {IoU:.3f}")
print(f"Dice: {Dice:.3f}")
4) Calibration and threshold selection
Pick a threshold to maximize F1, then estimate Expected Calibration Error (ECE) with simple binning.
import numpy as np
# Binary classification scores and labels
scores = np.array([0.95, 0.85, 0.70, 0.55, 0.45, 0.40, 0.20, 0.10])
labels = np.array([1, 1, 1, 0, 0, 0, 0, 0])
ths = np.linspace(0.1, 0.9, 9)
best_f1, best_t = -1, None
for t in ths:
pred = (scores >= t).astype(int)
tp = np.sum((pred==1) & (labels==1))
fp = np.sum((pred==1) & (labels==0))
fn = np.sum((pred==0) & (labels==1))
prec = tp / (tp+fp) if (tp+fp)>0 else 0
rec = tp / (tp+fn) if (tp+fn)>0 else 0
f1 = 2*prec*rec/(prec+rec) if (prec+rec)>0 else 0
if f1 > best_f1:
best_f1, best_t = f1, t
print(f"Best threshold by F1: {best_t:.2f} (F1={best_f1:.2f})")
# Simple ECE with 5 bins
B = 5
bins = np.linspace(0, 1, B+1)
inds = np.digitize(scores, bins) - 1
N = len(scores)
ECE = 0.0
for b in range(B):
idx = np.where(inds==b)[0]
if len(idx)==0:
continue
conf = scores[idx].mean()
acc = (scores[idx] >= best_t).astype(int)
acc = (acc == labels[idx]).mean()
ECE += abs(acc - conf) * (len(idx)/N)
print(f"ECE (approx): {ECE:.3f}")
5) Slice-based analysis and robustness checks
Evaluate by condition slices and under controlled perturbations.
import numpy as np
# Pretend we have per-image metadata: lighting in {day, night}
lighting = np.array(["day","day","night","night","day","night"])
# Binary labels and predicted scores
labels = np.array([1,0,1,0,1,0])
scores = np.array([0.9,0.6,0.7,0.4,0.8,0.3])
# Compute per-slice AUC-like proxy: simple threshold accuracy for demo
thr = 0.5
acc_day = np.mean(((scores[lighting=="day"]>=thr).astype(int) == labels[lighting=="day"]))
acc_night = np.mean(((scores[lighting=="night"]>=thr).astype(int) == labels[lighting=="night"]))
print(f"Accuracy day: {acc_day:.2f}")
print(f"Accuracy night: {acc_night:.2f}")
# Robustness: simulate brightness drop by subtracting 0.1 from scores
scores_dark = np.clip(scores - 0.1, 0, 1)
acc_night_dark = np.mean(((scores_dark[lighting=="night"]>=thr).astype(int) == labels[lighting=="night"]))
print(f"Night accuracy after darkening: {acc_night_dark:.2f}")
How to interpret the differences
- Large gaps between slices suggest data or model bias. Prioritize more data or augmentation for the weakest slice.
- Robustness deltas quantify sensitivity. Set targets (e.g., night accuracy within 5% of day).
Drills and exercises
- Compute Top-1 and Top-5 on a batch; verify with two different batch sizes.
- Write a vectorized IoU function for N x M boxes; unit test edge cases.
- Implement AP at IoU=0.5 and extend to mAP@[.5:.95].
- Plot precision–recall and choose thresholds for F1=balanced vs high-precision scenarios.
- Create 3 metadata slices (lighting, occlusion, camera) and report per-slice metrics.
- Apply brightness/contrast/rain augmentations; report metric deltas and worst-case performance.
- Build a confusion matrix and write a brief error taxonomy (3–5 failure modes).
- Run a 50-sample label audit; estimate noise rate and propose fixes.
Common mistakes and debugging tips
Mistake: Counting multiple detections per object as multiple true positives
Tip: After sorting by confidence, match each ground-truth object at most once (greedy by IoU), then mark extra matches as false positives.
Mistake: Using accuracy for imbalanced problems
Tip: Prefer precision/recall, PR AUC, and operating-point metrics (F1, recall@precision≥X) aligned to product goals.
Mistake: Ignoring calibration
Tip: Compute ECE or reliability diagrams. Consider temperature scaling or isotonic regression; re-tune thresholds after calibration.
Mistake: Aggregating metrics without slice visibility
Tip: Always report per-slice metrics and track the weakest slice over time.
Mistake: Comparing mAP numbers with different IoU thresholds
Tip: State the protocol (e.g., mAP@0.5 or mAP@[.5:.95]) and dataset version to make comparisons fair.
Mini project: Robust evaluation of a detector
Goal: Evaluate a single-class object detector and make it robust to night and rain conditions.
- Compute IoU and AP@0.5 on a validation set; verify with 5–10 manual examples.
- Build per-slice metrics for day vs night and clear vs rainy images; record gaps.
- Calibrate detection scores (e.g., temperature scaling) and re-tune the score threshold to hit precision≥0.9.
- Apply synthetic darkening and rain overlays; re-evaluate and report worst-case AP.
- Create a confusion and failure log with 20 reviewed cases; define three actionable fixes (data, augmentation, post-processing).
Practical project ideas
- Small-scale segmentation benchmark: Compare IoU/Dice across three augmentations and two thresholds.
- Calibration clinic: Implement temperature scaling and isotonic regression; compare ECE before/after.
- Slice tracker: Build a script that outputs a table of per-slice metrics and flags the weakest slice.
Subskills
- Metrics For Classification Top1 Top5 — Learn Top-1/Top-5 accuracy and when to use each. Estimated time: 45–90 min.
- Metrics For Detection Map Iou — Compute IoU, AP, and mAP with correct matching rules. Estimated time: 60–120 min.
- Metrics For Segmentation Iou Dice — Use IoU and Dice; handle class imbalance. Estimated time: 45–90 min.
- Calibration And Thresholding — Improve probability calibration and choose operating thresholds. Estimated time: 60–120 min.
- Slice Based Analysis By Conditions — Report metrics by metadata (lighting, weather, device). Estimated time: 45–90 min.
- Robustness Testing Lighting Weather — Test under controlled perturbations and report deltas. Estimated time: 45–90 min.
- Confusion And Failure Mode Analysis — Build confusion matrices and a failure taxonomy. Estimated time: 45–90 min.
- Human Review And Label Audits — Run sampling, double-blind checks, and estimate label noise. Estimated time: 45–90 min.
Next steps
- Automate your evaluation: one script that computes metrics, slice reports, robustness checks, and calibration in one run.
- Adopt “weakest-slice-first” dashboards to focus improvements.
- Document your evaluation protocol so future results are comparable.