luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Reranking And Retrieval Evaluation

Learn Reranking And Retrieval Evaluation for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Who this is for

Computer Vision Engineers and ML practitioners who build image or multimodal search systems and need to improve ranking quality and measure it correctly.

Prerequisites

  • Comfort with Python and basic NumPy/PyTorch operations
  • Familiarity with embeddings (cosine/L2 distance) and kNN search
  • Basic understanding of evaluation (precision/recall)

Why this matters

In real products, fast approximate search finds candidates, but the final order decides user satisfaction. Reranking and solid evaluation are critical in tasks like:

  • Visual product search: rank visually closest items that also match color/material constraints
  • De-duplication and near-duplicate detection: ensure true duplicates are at the top
  • Person or vehicle re-identification: robustly reorder candidates across cameras
  • Image–text retrieval (e.g., CLIP): improve relevance beyond coarse embedding similarity

Concept explained simply

Think of retrieval as a two-step funnel:

  • Step 1 — Candidate generation: a fast, broad net (ANN kNN on embeddings) returns top N.
  • Step 2 — Reranker: a slower, smarter sorter reorders those N using richer signals (cross-encoders, visual cues, metadata, reciprocal neighbors).

Mental model: "Find many, then choose well." If Step 1 misses relevant items, Step 2 cannot recover them. If Step 2 is weak, results feel off despite good candidates.

Key metrics to evaluate retrieval

  • Recall@K: average over queries of (relevant in top K / total relevant). Good for coverage.
  • Precision@K: average fraction of top K that are relevant. Good for purity at the top.
  • AP (Average Precision): area under precision–recall curve for one query (approximated by precision at ranks where a new relevant appears).
  • mAP: mean of AP over queries. The standard for retrieval quality.
  • MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant hit; focuses on first-hit speed.
  • NDCG@K: discounts gains by rank position; useful when relevance is graded.
  • Hit@K: whether at least one relevant appears in top K (binary per query; often used in image–text tasks).
Mini self-check: choosing metrics
  • Need "how soon is the first good result?" Choose MRR or Hit@K.
  • Need "overall ranking quality across many relevant items?" Choose mAP or NDCG.
  • Stakeholder cares about top 5? Report Precision@5 and Recall@5.

Worked examples

Example 1 — Product image search with two-stage reranking

  1. Candidate generation: Use normalized CLIP image embeddings with cosine similarity; retrieve top N=100 via ANN.
  2. Reranker: Combine embedding similarity with a color histogram similarity. Final score s = 1.0*cosine_sim - 0.2*chi2(color_hist).
  3. Evaluation: Compute Recall@1/5/10 and mAP@10 before vs after reranking.
What to expect

On small sets, reranking often improves Recall@5 by 5–15 percentage points and mAP@10 modestly. Variance is higher on tiny datasets.

Example 2 — Person re-identification with k-reciprocal re-ranking

  1. Start with distances between query and gallery from a re-id embedding.
  2. For each sample, build k-reciprocal neighbors (mutual top-k relationship).
  3. Compute Jaccard distance over these neighbor sets and blend with original distance: D' = (1-λ)*Jaccard + λ*D.
  4. Sort by D' and recompute mAP and Recall@K.
What to expect

This method exploits shared neighborhood structure. It frequently boosts mAP by 3–10 points on structured datasets.

Example 3 — Landmark retrieval with query expansion

  1. Retrieve top T (e.g., 10) for a query.
  2. Average the query embedding with the top few confident positives (AQE).
  3. Search again with the expanded query and evaluate mAP/NDCG@10.
What to expect

Query expansion helps when multiple relevant items per query exist and domain is visually consistent (e.g., landmarks).

How to build a reranking + evaluation pipeline

  1. Define relevance: exact match, class match, or graded relevance.
  2. Prepare data splits: queries vs gallery; exclude the query from the gallery when evaluating.
  3. Indexing: choose metric
    • Cosine on L2-normalized embeddings → use inner product search.
    • L2 distance on raw embeddings → use L2 search.
  4. Candidate generation: choose N (e.g., 100). Ensure Recall@N is high enough; if low, increase N or improve embeddings.
  5. Reranking choices:
    • Feature-based: add color/texture/spatial cues
    • k-reciprocal/Jaccard re-ranking (neighborhood structure)
    • Cross-encoder (e.g., image-text joint model) on candidates
    • Score fusion: weighted or learned calibration (e.g., z-score, logistic)
  6. Evaluate: compute Recall@K, Precision@K, mAP, and latency. Report before/after rerank and per-K breakdowns.
  7. Tune: adjust N, fusion weights, or λ in re-ranking; re-evaluate.
Distance and similarity tips
  • L2-normalize embeddings before cosine/inner product search.
  • Keep distance/similarity directions consistent when fusing (convert all to scores where higher is better).
  • Calibrate heterogeneous scores (z-score or min-max on validation) before fusion.

Exercises you can do today

These mirror the exercises below. Use any small image dataset (e.g., 50–200 images across 5–10 categories).

Exercise 1 — Two-stage image retrieval with color-aware reranking

  • Compute image embeddings and build an ANN index.
  • Retrieve top 20 per query, then rerank by combining cosine similarity with a color histogram cue.
  • Report Recall@1/5/10 and mAP@10 before/after.
Tips
  • L2-normalize embeddings before cosine.
  • Exclude the query from its own candidates.
  • Start with α=1.0, β=0.2; sweep β.

Exercise 2 — Implement k-reciprocal re-ranking

  • From a distance matrix, build k-reciprocal neighbor sets with k1=20, refine with k2=6.
  • Blend Jaccard distance with original using λ=0.3.
  • Evaluate improvements in mAP and Recall@K.
Tips
  • Use mutual neighbor check to build reciprocal sets.
  • Use soft weighting by rank for robustness.
Checklist before you measure
  • Queries and gallery are disjoint, and the query is not in the gallery
  • Metrics computed over all queries, not just a subset
  • Scores are comparable across candidates after fusion
  • Latency recorded for both stages

Common mistakes and self-check

  • Too small candidate pool (N): reranker cannot surface missing relevant items. Self-check: measure Recall@N; aim for high coverage.
  • Mismatched scoring scales: mixing distances and similarities without calibration. Self-check: normalize and verify monotonicity.
  • Data leakage: query present in gallery. Self-check: assert IDs differ.
  • Only reporting one metric: hides trade-offs. Self-check: report mAP, Recall@K, and latency.
  • Ignoring class imbalance: some queries have many/few positives. Self-check: look at per-query AP distribution.

Practical projects

  • Build a fashion image search demo: ANN retrieval + color/texture rerank; target Recall@5 ≥ 0.7 on your set.
  • Person re-id small benchmark: implement k-reciprocal re-ranking and compare mAP with/without it.
  • Multimodal retrieval: image-to-text with a cross-encoder reranker; evaluate Hit@1 and NDCG@10.

Learning path

  1. Master embedding similarity and ANN indexing.
  2. Implement baseline metrics (Recall@K, mAP) correctly.
  3. Add simple reranking (feature cues or score fusion).
  4. Try k-reciprocal re-ranking for structure-aware gains.
  5. Introduce cross-encoders for precision at top ranks.
  6. Optimize latency: choose N, cache features, precompute neighbors.
  7. Harden evaluation: per-query analysis, statistical significance, error review.

Mini challenge

On your dataset, fix candidate N=100 and tune only the reranker. Can you increase mAP@10 by at least 5 points without increasing latency by more than 20%? Document your changes and results.

Next steps

  • Scale up to a larger dataset to stress-test N and latency.
  • Add a second reranking signal (e.g., text tags or object counts) and calibrate scores.
  • Perform an error analysis session: sample 20 failures and categorize causes; fix the top pattern.

Quick test

Take the quick test to reinforce your understanding. Note: Everyone can take the test; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

  1. Collect ~60 images across 6 categories (10 per category). Split each category into 5 queries and 5 gallery images.
  2. Compute image embeddings with any pretrained encoder (e.g., CLIP ViT-B/32 or ResNet50 pooled). L2-normalize embeddings.
  3. Build an ANN index. For cosine similarity, use inner product search; for L2, use L2 index.
  4. For each query, retrieve top 20 gallery items.
  5. Compute a 32-bin HSV color histogram for each image. Define chi-square distance between histograms.
  6. Create a fused score s = α*cosine_sim - β*chi2_hist, with α=1.0 and β in {0.1, 0.2, 0.3}. Higher s is better.
  7. Rerank the 20 candidates by s. Compute Recall@1/5/10 and mAP@10 before vs after reranking.
  8. Report the metrics table and the best β. Include latency for both stages.
Expected Output
A small report showing before/after metrics. Typical outcome: Recall@5 improves by 5–15 points; mAP@10 improves modestly. Exact numbers vary by data.

Reranking And Retrieval Evaluation — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Reranking And Retrieval Evaluation?

AI Assistant

Ask questions about this tool