Why this matters
Top-1 and Top-5 accuracy are the go-to metrics when evaluating image classification models with large label spaces (e.g., ImageNet-scale). As a Computer Vision Engineer, you will use them to:
- Benchmark new architectures and training runs quickly.
- Communicate performance to stakeholders with intuitive numbers.
- Compare against baselines and research papers that report Top-1/Top-5.
- Detect ranking issues: the model may often place the correct class near the top even if not first.
Concept explained simply
Think of the model producing a ranked list of class guesses for each image.
- Top-1 accuracy: the fraction of images where the number 1 guess matches the ground-truth label.
- Top-5 accuracy: the fraction of images where the ground-truth label appears anywhere in the top five guesses.
Mental model
Imagine a podium of K spots. If the correct class stands on the podium, you score a hit for Top-K. Top-1 is a one-spot podium. Top-5 is a five-spot podium. As K grows, it becomes easier to hit.
When to use Top-5 vs Top-1
- Large label spaces (hundreds/thousands of classes): report both Top-1 and Top-5.
- Small label spaces (e.g., 5-10 classes): Top-5 may become trivial; prefer Top-1 and class-wise metrics.
- Multi-label tasks: Top-K accuracy is usually not appropriate; use metrics like mAP, F1, or per-label precision/recall.
How to compute Top-K
- For each sample, get scores for all classes (logits or probabilities).
- Sort classes by score descending or use a Top-K operator to get the K highest-scoring classes.
- Check if the ground-truth class is among those K classes.
- Count hits across the dataset and divide by total samples.
Edge cases to handle
- If K > number of classes: clamp K to the number of classes.
- Ties in scores: break ties with a stable rule (e.g., by index) to keep results deterministic.
- Unknown/other class: decide in advance how to handle; usually excluded or treated consistently across runs.
Worked examples
Example 1: Single sample, Top-1 vs Top-3
Classes: [cat, dog, car, bike]. Scores: [0.20, 0.45, 0.25, 0.10]. Ground truth: car.
- Top-1 guess: dog (0.45). Not correct.
- Top-3 guesses: [dog, car, cat]. Ground truth (car) is in Top-3: hit.
Top-1 = 0/1, Top-3 = 1/1.
Example 2: ImageNet-like Top-5
Classes (thousands). Suppose the Top-5 are [golden_retriever, labrador, kuvasz, borzoi, beagle]. Ground truth: labrador. Even if Top-1 is golden_retriever, Top-5 counts it as correct because labrador is within the five.
Example 3: Batch computation
Batch of 3 samples, classes [0..4].
- Sample A: scores [0.1, 0.5, 0.15, 0.1, 0.15], y=1. Top-1=1 is correct. Top-3 contains 1: correct.
- Sample B: scores [0.3, 0.25, 0.2, 0.15, 0.1], y=4. Top-1=0, Top-3=[0,1,2]; 4 not in Top-3: incorrect.
- Sample C: scores [0.05, 0.08, 0.7, 0.1, 0.07], y=2. Top-1=2: correct.
Top-1: 2/3 ≈ 66.7%. Top-3: Sample A (hit), B (miss), C (hit) → 2/3 ≈ 66.7%.
Implementation tips
Simple pseudocode (NumPy-like)
def topk_accuracy(scores, y_true, k):
# scores: shape (N, C)
# y_true: shape (N,)
k = min(k, scores.shape[1])
# argsort descending and take top k indices
topk_idx = np.argsort(-scores, axis=1)[:, :k]
hits = (topk_idx == y_true.reshape(-1, 1)).any(axis=1)
return hits.mean()
PyTorch snippet
with torch.no_grad():
# logits: (N, C), targets: (N,)
max_k = min(5, logits.size(1))
_, pred = logits.topk(max_k, dim=1, largest=True, sorted=True)
correct = pred.eq(targets.view(-1, 1))
top1 = correct[:, :1].any(dim=1).float().mean().item()
top5 = correct[:, :5].any(dim=1).float().mean().item()
Fast Top-K without full sort
Use partial selection (e.g., topk/partition) to avoid O(C log C) sorting when C is large. This matters for thousands of classes.
Exercises
Do the tasks below to solidify your understanding. You can check solutions interactively. Note: The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1 (mirror of interactive)
Classes: [cat, dog, car, bike, bird, boat]. For each sample, the list shows scores in the same class order. Compute Top-1 and Top-5 accuracy.
- S1: y=dog, scores=[0.10, 0.40, 0.20, 0.05, 0.15, 0.10]
- S2: y=boat, scores=[0.20, 0.25, 0.18, 0.03, 0.20, 0.14]
- S3: y=bird, scores=[0.05, 0.10, 0.15, 0.40, 0.20, 0.10]
- S4: y=car, scores=[0.30, 0.22, 0.18, 0.12, 0.10, 0.08]
- S5: y=bike, scores=[0.12, 0.08, 0.10, 0.01, 0.40, 0.29]
- S6: y=cat, scores=[0.33, 0.11, 0.30, 0.09, 0.10, 0.07]
Checklist before solving
- Sort each vector descending; note the top index.
- Top-1 counts if top index equals the true class index.
- Top-5 counts if the true class index appears among the top 5.
Show solution
Top-1 hits: S1 (dog is top), S6 (cat is top) → 2/6 ≈ 33.3%.
Top-5 hits: All except S5 (bike has the lowest score, 6th) → 5/6 ≈ 83.3%.
Exercise 2 (mirror of interactive)
Write a function signature to compute Top-K accuracy for multiple K values at once (e.g., K in [1,3,5]). Provide a brief plan or pseudocode.
Hint
Compute predictions up to max(K). Then reuse the same top indices to test membership for each K.
Show solution
def multi_topk_accuracy(scores, y_true, ks):
ks = sorted(set([min(k, scores.shape[1]) for k in ks]))
max_k = ks[-1]
topk_idx = np.argsort(-scores, axis=1)[:, :max_k]
results = {}
for k in ks:
hits = (topk_idx[:, :k] == y_true.reshape(-1,1)).any(axis=1)
results[k] = hits.mean()
return results # e.g., {1: 0.78, 3: 0.91, 5: 0.95}
Common mistakes and self-check
- Using Top-K on multi-label tasks. Fix: use mAP/F1 instead; Top-K assumes a single correct label per sample.
- Forgetting to clamp K to number of classes. Fix: k = min(k, C).
- Sorting the wrong dimension. Fix: verify shape (N, C) and sort along classes.
- Assuming calibrated probabilities are needed. They are not; only the ranking matters for Top-K.
- Ignoring ties. Fix: adopt a consistent tie-break rule.
Self-check
- Top-5 should never be lower than Top-1 on the same split.
- If K ≥ number of classes and there is exactly one ground-truth class per sample, Top-K should be 100%.
- Changing softmax temperature should not change Top-K if ranking is unchanged.
Mini challenge
You have 4 classes and these predictions for two images:
- I1: scores=[2.1, 0.4, 1.9, 0.2], y=2
- I2: scores=[0.9, 1.2, 1.1, 1.3], y=3
Compute Top-1 and Top-3 accuracy. Verify monotonicity (Top-3 ≥ Top-1). Use scratch paper, then confirm:
Reveal answer
I1: sorted indices by score → [0,2,1,3]; Top-1=0 ≠ 2 → miss; Top-3 contains 2 → hit.
I2: sorted indices → [3,1,2,0]; Top-1=3 = y → hit; Top-3 contains 3 → hit.
Top-1: 1/2 = 50%. Top-3: 2/2 = 100%.
Who this is for
- Computer Vision Engineers evaluating image classifiers.
- ML practitioners comparing models with large label spaces.
- Students preparing for benchmarks that report Top-1/Top-5.
Prerequisites
- Basic classification understanding (logits, softmax, argmax).
- Comfort with arrays/tensors and sorting/top-k operations.
- Single-label vs multi-label distinction.
Learning path
- Refresh single-label classification and accuracy.
- Learn Top-K metrics and why they matter for large label spaces.
- Implement Top-1/Top-5 efficiently (partial top-k).
- Add evaluation to your training loop and log both metrics.
- Analyze gaps between Top-1 and Top-5 to guide improvements.
Practical projects
- Add Top-1/Top-5 evaluation to a training script for a dataset with 100+ classes. Track metrics per epoch.
- Experiment with data augmentation or label smoothing and see how Top-1 vs Top-5 change.
- Diagnose a model with high Top-5 but low Top-1. Inspect misranked classes and adjust loss or architecture.
Next steps
- Complement Top-K with class-wise accuracy and confusion matrices.
- Consider calibration metrics (ECE) to understand confidence quality.
- If your task is multi-label, switch to mAP/F1 and per-label recall/precision.
Quick Test
Take the quick test below to check your understanding. Available to everyone; only logged-in users get saved progress.