Why this matters
Similarity search for images powers real features: visually similar product recommendations, near-duplicate detection, content moderation, and photo dedup in galleries. As a Computer Vision Engineer, you will extract embeddings from images and quickly find the closest matches in a large collection.
- Recommend similar items in e-commerce when a user views a product.
- Find and remove near-duplicates to clean training datasets.
- Retrieve similar images to speed up labeling and triage.
- Cluster images by visual style or subject.
Concept explained simply
We convert each image into a vector (embedding). Similar images have vectors that point in a similar direction. We then compare vectors to find the closest ones.
Mental model
Imagine every image is a point on a high-dimensional unit sphere. Similarity equals how small the angle is between two points. Cosine similarity captures that angle.
Key terms
- Embedding: a numeric vector representing an image.
- Cosine similarity: dot product of L2-normalized vectors (range -1 to 1; higher is more similar).
- Distance: the opposite of similarity (e.g., cosine distance = 1 - cosine similarity).
- ANN index: Approximate Nearest Neighbor data structure that speeds up search with small accuracy trade-offs.
Core workflow
Cosine vs Euclidean
- If vectors are L2-normalized, cosine similarity and inner product are equivalent rankings.
- Euclidean distance on unit vectors is monotonic with cosine; all three give the same top-k.
Worked examples
Example 1 β Near-duplicate detection
Goal: flag images that are essentially the same (e.g., re-uploads or resized copies).
- Embed and L2-normalize all images.
- For each image, find top-1 neighbor (excluding itself) with cosine similarity.
- If similarity β₯ 0.97, mark as near-duplicate.
Why it works: duplicates have almost identical vectors, producing very high cosine similarity. Start with 0.97 and adjust from sample review.
Example 2 β Visual search for recommendations
Goal: when a user views a product image, show 10 visually similar products.
- Precompute gallery embeddings and build an ANN index.
- Query: embed image β normalize β ANN top-100 β re-rank with exact cosine β return top-10.
- Add metadata filters (e.g., same category, in-stock) after retrieval.
Tip: maintain recall β₯ 0.95 at top-100 so re-ranking has quality candidates.
Example 3 β Class-specific retrieval (faces, logos)
Goal: given a face image, retrieve the same person from a gallery.
- Use a domain-specific embedding model (face or logo encoder).
- Normalize embeddings and search with cosine similarity.
- Select threshold per class/task using a validation set (e.g., TPR at fixed FPR).
Threshold tuning mini-protocol
- Collect labeled pairs (same vs different).
- Compute similarities and plot distributions.
- Pick threshold that meets precision or FPR constraints, then verify recall.
Key formulas and choices
- Cosine similarity: sim(u, v) = (u Β· v) / (||u|| ||v||). With L2-normalization, sim(u, v) = u Β· v.
- Cosine distance: 1 β cosine_similarity.
- When to normalize: almost always before indexing; improves stability and comparability.
- Index choice: small data (β€ 100k) β brute force may suffice; larger β ANN (e.g., IVF-PQ, HNSW). Trade memory, speed, and recall.
- Quantization: float16 or product quantization reduces memory; validate impact on recall.
Evaluating your system
- Precision@k: fraction of relevant images in the top-k results.
- Recall@k: fraction of all relevant images that appear in the top-k.
- Latency: end-to-end time per query (embedding + search + re-ranking).
- Throughput: queries per second under target hardware.
Practical evaluation steps
- Hold out a labeled set of queries and relevant galleries.
- Measure precision@k and recall@k for multiple k (e.g., 1, 5, 10, 50, 100).
- Test multiple thresholds and choose one that meets business constraints (e.g., precision β₯ 0.9).
- Benchmark both exact and ANN search to quantify accuracy/speed trade-offs.
Common mistakes and self-checks
- Forgetting L2-normalization: self-check by inspecting norms; they should be ~1.0.
- Using cosine thresholds on non-normalized vectors: normalize first.
- Comparing scores across different models or training runs: not comparable; keep the model fixed.
- Skipping re-ranking after ANN: can degrade top-10 precision; re-rank the candidate pool with exact similarity.
- Too-low thresholds for duplicates: leads to many false positives; review score histograms.
- No metadata filtering: irrelevant but visually similar items sneak in; apply filters post-retrieval.
Exercises you can try
These mirror the graded exercises below. Do them here, then submit in the exercise section to check your answers.
- Top-2 cosine neighbors (by hand): Given a query q = [0.6, 0.8, 0] and gallery vectors g1=[1,0,0], g2=[0,1,0], g3=[0.5,0.5,0], g4=[0,0,1], g5=[0.2,0.1,0]: L2-normalize all and compute cosine similarities. List Top-2 IDs and scores.
- Threshold selection: Similarity scores [0.95, 0.92, 0.88, 0.83, 0.78, 0.65, 0.55, 0.40] with labels [1,1,1,0,1,0,0,0] (1=relevant). Find the highest recall threshold such that precision β₯ 0.90.
- Index design: You have 1,000,000 images with 512-d float32 embeddings. Propose an ANN index type and configuration to target sub-100 ms queries with β₯0.95 recall@100 on a single machine. Estimate memory and justify trade-offs.
- Checklist before running your system:
- Embeddings are L2-normalized.
- Index chosen based on data size and latency goals.
- ANN candidates re-ranked exactly.
- Threshold validated on labeled pairs.
- Latency profiled end-to-end.
Practical projects
- Build a mini visual search: index 10k images, implement top-10 retrieval with cosine similarity, and add a category filter.
- Duplicate cleaner: find near-duplicates in a mixed photo collection and auto-suggest deletions with a human review step.
- Style-based retrieval: retrieve outfits or artworks with similar color palettes and textures; evaluate precision@10 by manual labeling.
Who this is for
- Computer Vision Engineers implementing retrieval, deduplication, or recommendation features.
- ML Engineers adding embedding-based search to products.
- Data Scientists evaluating embedding quality and thresholds.
Prerequisites
- Basic linear algebra (vectors, norms, dot product).
- Familiarity with CNN-based embeddings.
- Python/NumPy experience helps for prototyping.
Learning path
- Refresh vector similarity (cosine, Euclidean) and L2-normalization.
- Generate and validate embeddings for your domain.
- Implement exact search; verify quality on a small set.
- Scale with ANN; tune recall vs latency.
- Set thresholds; evaluate precision/recall on labeled pairs.
- Integrate metadata filtering and re-ranking.
Next steps
- Experiment with different embedding dimensions and pooling strategies.
- Try quantization (float16, PQ) and measure impact on recall and latency.
- Add online monitoring for drift in score distributions.
Mini challenge
Take a set of 5,000 images from two categories (e.g., shoes and bags). Build a visual search that returns top-10 similar items with a same-category filter, and choose a threshold for near-duplicates. Report precision@10, recall@10, and average latency.
Quick Test is available to everyone. Only logged-in users will have their progress saved.