Who this is for
Computer Vision Engineers and ML practitioners who build image or multimodal search systems and need to improve ranking quality and measure it correctly.
Prerequisites
- Comfort with Python and basic NumPy/PyTorch operations
- Familiarity with embeddings (cosine/L2 distance) and kNN search
- Basic understanding of evaluation (precision/recall)
Why this matters
In real products, fast approximate search finds candidates, but the final order decides user satisfaction. Reranking and solid evaluation are critical in tasks like:
- Visual product search: rank visually closest items that also match color/material constraints
- De-duplication and near-duplicate detection: ensure true duplicates are at the top
- Person or vehicle re-identification: robustly reorder candidates across cameras
- Image–text retrieval (e.g., CLIP): improve relevance beyond coarse embedding similarity
Concept explained simply
Think of retrieval as a two-step funnel:
- Step 1 — Candidate generation: a fast, broad net (ANN kNN on embeddings) returns top N.
- Step 2 — Reranker: a slower, smarter sorter reorders those N using richer signals (cross-encoders, visual cues, metadata, reciprocal neighbors).
Mental model: "Find many, then choose well." If Step 1 misses relevant items, Step 2 cannot recover them. If Step 2 is weak, results feel off despite good candidates.
Key metrics to evaluate retrieval
- Recall@K: average over queries of (relevant in top K / total relevant). Good for coverage.
- Precision@K: average fraction of top K that are relevant. Good for purity at the top.
- AP (Average Precision): area under precision–recall curve for one query (approximated by precision at ranks where a new relevant appears).
- mAP: mean of AP over queries. The standard for retrieval quality.
- MRR (Mean Reciprocal Rank): average of 1/rank of the first relevant hit; focuses on first-hit speed.
- NDCG@K: discounts gains by rank position; useful when relevance is graded.
- Hit@K: whether at least one relevant appears in top K (binary per query; often used in image–text tasks).
Mini self-check: choosing metrics
- Need "how soon is the first good result?" Choose MRR or Hit@K.
- Need "overall ranking quality across many relevant items?" Choose mAP or NDCG.
- Stakeholder cares about top 5? Report Precision@5 and Recall@5.
Worked examples
Example 1 — Product image search with two-stage reranking
- Candidate generation: Use normalized CLIP image embeddings with cosine similarity; retrieve top N=100 via ANN.
- Reranker: Combine embedding similarity with a color histogram similarity. Final score s = 1.0*cosine_sim - 0.2*chi2(color_hist).
- Evaluation: Compute Recall@1/5/10 and mAP@10 before vs after reranking.
What to expect
On small sets, reranking often improves Recall@5 by 5–15 percentage points and mAP@10 modestly. Variance is higher on tiny datasets.
Example 2 — Person re-identification with k-reciprocal re-ranking
- Start with distances between query and gallery from a re-id embedding.
- For each sample, build k-reciprocal neighbors (mutual top-k relationship).
- Compute Jaccard distance over these neighbor sets and blend with original distance: D' = (1-λ)*Jaccard + λ*D.
- Sort by D' and recompute mAP and Recall@K.
What to expect
This method exploits shared neighborhood structure. It frequently boosts mAP by 3–10 points on structured datasets.
Example 3 — Landmark retrieval with query expansion
- Retrieve top T (e.g., 10) for a query.
- Average the query embedding with the top few confident positives (AQE).
- Search again with the expanded query and evaluate mAP/NDCG@10.
What to expect
Query expansion helps when multiple relevant items per query exist and domain is visually consistent (e.g., landmarks).
How to build a reranking + evaluation pipeline
- Define relevance: exact match, class match, or graded relevance.
- Prepare data splits: queries vs gallery; exclude the query from the gallery when evaluating.
- Indexing: choose metric
- Cosine on L2-normalized embeddings → use inner product search.
- L2 distance on raw embeddings → use L2 search.
- Candidate generation: choose N (e.g., 100). Ensure Recall@N is high enough; if low, increase N or improve embeddings.
- Reranking choices:
- Feature-based: add color/texture/spatial cues
- k-reciprocal/Jaccard re-ranking (neighborhood structure)
- Cross-encoder (e.g., image-text joint model) on candidates
- Score fusion: weighted or learned calibration (e.g., z-score, logistic)
- Evaluate: compute Recall@K, Precision@K, mAP, and latency. Report before/after rerank and per-K breakdowns.
- Tune: adjust N, fusion weights, or λ in re-ranking; re-evaluate.
Distance and similarity tips
- L2-normalize embeddings before cosine/inner product search.
- Keep distance/similarity directions consistent when fusing (convert all to scores where higher is better).
- Calibrate heterogeneous scores (z-score or min-max on validation) before fusion.
Exercises you can do today
These mirror the exercises below. Use any small image dataset (e.g., 50–200 images across 5–10 categories).
Exercise 1 — Two-stage image retrieval with color-aware reranking
- Compute image embeddings and build an ANN index.
- Retrieve top 20 per query, then rerank by combining cosine similarity with a color histogram cue.
- Report Recall@1/5/10 and mAP@10 before/after.
Tips
- L2-normalize embeddings before cosine.
- Exclude the query from its own candidates.
- Start with α=1.0, β=0.2; sweep β.
Exercise 2 — Implement k-reciprocal re-ranking
- From a distance matrix, build k-reciprocal neighbor sets with k1=20, refine with k2=6.
- Blend Jaccard distance with original using λ=0.3.
- Evaluate improvements in mAP and Recall@K.
Tips
- Use mutual neighbor check to build reciprocal sets.
- Use soft weighting by rank for robustness.
Checklist before you measure
- Queries and gallery are disjoint, and the query is not in the gallery
- Metrics computed over all queries, not just a subset
- Scores are comparable across candidates after fusion
- Latency recorded for both stages
Common mistakes and self-check
- Too small candidate pool (N): reranker cannot surface missing relevant items. Self-check: measure Recall@N; aim for high coverage.
- Mismatched scoring scales: mixing distances and similarities without calibration. Self-check: normalize and verify monotonicity.
- Data leakage: query present in gallery. Self-check: assert IDs differ.
- Only reporting one metric: hides trade-offs. Self-check: report mAP, Recall@K, and latency.
- Ignoring class imbalance: some queries have many/few positives. Self-check: look at per-query AP distribution.
Practical projects
- Build a fashion image search demo: ANN retrieval + color/texture rerank; target Recall@5 ≥ 0.7 on your set.
- Person re-id small benchmark: implement k-reciprocal re-ranking and compare mAP with/without it.
- Multimodal retrieval: image-to-text with a cross-encoder reranker; evaluate Hit@1 and NDCG@10.
Learning path
- Master embedding similarity and ANN indexing.
- Implement baseline metrics (Recall@K, mAP) correctly.
- Add simple reranking (feature cues or score fusion).
- Try k-reciprocal re-ranking for structure-aware gains.
- Introduce cross-encoders for precision at top ranks.
- Optimize latency: choose N, cache features, precompute neighbors.
- Harden evaluation: per-query analysis, statistical significance, error review.
Mini challenge
On your dataset, fix candidate N=100 and tune only the reranker. Can you increase mAP@10 by at least 5 points without increasing latency by more than 20%? Document your changes and results.
Next steps
- Scale up to a larger dataset to stress-test N and latency.
- Add a second reranking signal (e.g., text tags or object counts) and calibrate scores.
- Perform an error analysis session: sample 20 failures and categorize causes; fix the top pattern.
Quick test
Take the quick test to reinforce your understanding. Note: Everyone can take the test; only logged-in users will have their progress saved.