luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Similarity Search For Images

Learn Similarity Search For Images for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Similarity search for images powers real features: visually similar product recommendations, near-duplicate detection, content moderation, and photo dedup in galleries. As a Computer Vision Engineer, you will extract embeddings from images and quickly find the closest matches in a large collection.

  • Recommend similar items in e-commerce when a user views a product.
  • Find and remove near-duplicates to clean training datasets.
  • Retrieve similar images to speed up labeling and triage.
  • Cluster images by visual style or subject.

Concept explained simply

We convert each image into a vector (embedding). Similar images have vectors that point in a similar direction. We then compare vectors to find the closest ones.

Mental model

Imagine every image is a point on a high-dimensional unit sphere. Similarity equals how small the angle is between two points. Cosine similarity captures that angle.

Key terms
  • Embedding: a numeric vector representing an image.
  • Cosine similarity: dot product of L2-normalized vectors (range -1 to 1; higher is more similar).
  • Distance: the opposite of similarity (e.g., cosine distance = 1 - cosine similarity).
  • ANN index: Approximate Nearest Neighbor data structure that speeds up search with small accuracy trade-offs.

Core workflow

Step 1: Choose an embedding model (e.g., a CNN encoder). Keep the embedding size manageable (e.g., 128–1024).
Step 2: Preprocess images consistently (resize, crop, normalize channels).
Step 3: Compute embeddings and L2-normalize them.
Step 4: Build an index: start with brute force for small sets; move to ANN (e.g., IVF, HNSW, PQ) for large sets.
Step 5: Query: embed the query image, normalize, search top-k nearest, then optionally re-rank exactly.
Step 6: Threshold and evaluate using precision@k, recall, and latency targets.
Cosine vs Euclidean
  • If vectors are L2-normalized, cosine similarity and inner product are equivalent rankings.
  • Euclidean distance on unit vectors is monotonic with cosine; all three give the same top-k.

Worked examples

Example 1 β€” Near-duplicate detection

Goal: flag images that are essentially the same (e.g., re-uploads or resized copies).

  • Embed and L2-normalize all images.
  • For each image, find top-1 neighbor (excluding itself) with cosine similarity.
  • If similarity β‰₯ 0.97, mark as near-duplicate.

Why it works: duplicates have almost identical vectors, producing very high cosine similarity. Start with 0.97 and adjust from sample review.

Example 2 β€” Visual search for recommendations

Goal: when a user views a product image, show 10 visually similar products.

  • Precompute gallery embeddings and build an ANN index.
  • Query: embed image β†’ normalize β†’ ANN top-100 β†’ re-rank with exact cosine β†’ return top-10.
  • Add metadata filters (e.g., same category, in-stock) after retrieval.

Tip: maintain recall β‰₯ 0.95 at top-100 so re-ranking has quality candidates.

Example 3 β€” Class-specific retrieval (faces, logos)

Goal: given a face image, retrieve the same person from a gallery.

  • Use a domain-specific embedding model (face or logo encoder).
  • Normalize embeddings and search with cosine similarity.
  • Select threshold per class/task using a validation set (e.g., TPR at fixed FPR).
Threshold tuning mini-protocol
  1. Collect labeled pairs (same vs different).
  2. Compute similarities and plot distributions.
  3. Pick threshold that meets precision or FPR constraints, then verify recall.

Key formulas and choices

  • Cosine similarity: sim(u, v) = (u Β· v) / (||u|| ||v||). With L2-normalization, sim(u, v) = u Β· v.
  • Cosine distance: 1 βˆ’ cosine_similarity.
  • When to normalize: almost always before indexing; improves stability and comparability.
  • Index choice: small data (≀ 100k) β†’ brute force may suffice; larger β†’ ANN (e.g., IVF-PQ, HNSW). Trade memory, speed, and recall.
  • Quantization: float16 or product quantization reduces memory; validate impact on recall.

Evaluating your system

  • Precision@k: fraction of relevant images in the top-k results.
  • Recall@k: fraction of all relevant images that appear in the top-k.
  • Latency: end-to-end time per query (embedding + search + re-ranking).
  • Throughput: queries per second under target hardware.
Practical evaluation steps
  • Hold out a labeled set of queries and relevant galleries.
  • Measure precision@k and recall@k for multiple k (e.g., 1, 5, 10, 50, 100).
  • Test multiple thresholds and choose one that meets business constraints (e.g., precision β‰₯ 0.9).
  • Benchmark both exact and ANN search to quantify accuracy/speed trade-offs.

Common mistakes and self-checks

  • Forgetting L2-normalization: self-check by inspecting norms; they should be ~1.0.
  • Using cosine thresholds on non-normalized vectors: normalize first.
  • Comparing scores across different models or training runs: not comparable; keep the model fixed.
  • Skipping re-ranking after ANN: can degrade top-10 precision; re-rank the candidate pool with exact similarity.
  • Too-low thresholds for duplicates: leads to many false positives; review score histograms.
  • No metadata filtering: irrelevant but visually similar items sneak in; apply filters post-retrieval.

Exercises you can try

These mirror the graded exercises below. Do them here, then submit in the exercise section to check your answers.

  1. Top-2 cosine neighbors (by hand): Given a query q = [0.6, 0.8, 0] and gallery vectors g1=[1,0,0], g2=[0,1,0], g3=[0.5,0.5,0], g4=[0,0,1], g5=[0.2,0.1,0]: L2-normalize all and compute cosine similarities. List Top-2 IDs and scores.
  2. Threshold selection: Similarity scores [0.95, 0.92, 0.88, 0.83, 0.78, 0.65, 0.55, 0.40] with labels [1,1,1,0,1,0,0,0] (1=relevant). Find the highest recall threshold such that precision β‰₯ 0.90.
  3. Index design: You have 1,000,000 images with 512-d float32 embeddings. Propose an ANN index type and configuration to target sub-100 ms queries with β‰₯0.95 recall@100 on a single machine. Estimate memory and justify trade-offs.
  • Checklist before running your system:
    • Embeddings are L2-normalized.
    • Index chosen based on data size and latency goals.
    • ANN candidates re-ranked exactly.
    • Threshold validated on labeled pairs.
    • Latency profiled end-to-end.

Practical projects

  • Build a mini visual search: index 10k images, implement top-10 retrieval with cosine similarity, and add a category filter.
  • Duplicate cleaner: find near-duplicates in a mixed photo collection and auto-suggest deletions with a human review step.
  • Style-based retrieval: retrieve outfits or artworks with similar color palettes and textures; evaluate precision@10 by manual labeling.

Who this is for

  • Computer Vision Engineers implementing retrieval, deduplication, or recommendation features.
  • ML Engineers adding embedding-based search to products.
  • Data Scientists evaluating embedding quality and thresholds.

Prerequisites

  • Basic linear algebra (vectors, norms, dot product).
  • Familiarity with CNN-based embeddings.
  • Python/NumPy experience helps for prototyping.

Learning path

  1. Refresh vector similarity (cosine, Euclidean) and L2-normalization.
  2. Generate and validate embeddings for your domain.
  3. Implement exact search; verify quality on a small set.
  4. Scale with ANN; tune recall vs latency.
  5. Set thresholds; evaluate precision/recall on labeled pairs.
  6. Integrate metadata filtering and re-ranking.

Next steps

  • Experiment with different embedding dimensions and pooling strategies.
  • Try quantization (float16, PQ) and measure impact on recall and latency.
  • Add online monitoring for drift in score distributions.

Mini challenge

Take a set of 5,000 images from two categories (e.g., shoes and bags). Build a visual search that returns top-10 similar items with a same-category filter, and choose a threshold for near-duplicates. Report precision@10, recall@10, and average latency.

Quick Test is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

3 exercises to complete

Instructions

Given query vector q = [0.6, 0.8, 0] and gallery vectors:

  • g1 = [1, 0, 0]
  • g2 = [0, 1, 0]
  • g3 = [0.5, 0.5, 0]
  • g4 = [0, 0, 1]
  • g5 = [0.2, 0.1, 0]

Steps:

  1. L2-normalize all vectors.
  2. Compute cosine similarities between q and each gi.
  3. Return the Top-2 IDs and their similarity scores (rounded to 3 decimals).
Expected Output
Top-2: g3 (β‰ˆ0.990), g5 (β‰ˆ0.894). Order: g3 > g5 > g2 > g1 > g4.

Similarity Search For Images β€” Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Similarity Search For Images?

AI Assistant

Ask questions about this tool