Why this matters
Computer Vision Engineers rely on vector indexes to search, compare, and organize image embeddings at scale. Typical tasks:
- Image similarity search: find visually similar products, scenes, or faces from millions of images.
- Duplicate/near-duplicate detection to clean datasets and catalogs.
- Content moderation: fast lookups against known risky content embeddings.
- Recommendation: retrieve visually similar items to power related products.
Choosing the right index can cut latency from seconds to milliseconds and reduce memory costs, while maintaining recall and quality.
Concept explained simply
An embedding turns an image into a vector (a list of numbers). A vector index is a data structure that makes finding the nearest vectors fast. Different index types trade memory, speed, and accuracy.
Key terms
- Distance metric: how we measure similarity (e.g., cosine, L2, inner product).
- Recall@k: fraction of true nearest neighbors found among top-k results.
- Latency: time to answer a query (watch P95/P99, not just average).
- ANN: Approximate Nearest Neighbor search. Faster than brute-force, may miss some neighbors.
- Filtering: restricting search by metadata (e.g., category=βshoesβ).
Mental model
Imagine a huge library of image descriptors. A flat scan reads every card to find the closest ones: accurate but slow. An inverted file (IVF) first jumps to a shelf (coarse cluster) where good matches likely live, then looks only there. HNSW builds a multi-level map to hop quickly toward close neighbors. PQ (Product Quantization) makes each card smaller by compressing it, speeding memory-bound searches at the cost of some precision.
The building blocks
- Flat (brute-force): exact results; scales poorly; good baseline and small collections.
- IVF (coarse quantizer): partitions vectors into nlist clusters; search with nprobe clusters.
- HNSW: graph-based; excellent latency/recall; supports dynamic inserts; memory overhead per vector.
- PQ/OPQ: compress vectors to compact codes; pair with IVF to reduce RAM and speed I/O.
- Metrics: cosine (often normalize vectors first), L2, inner product. For normalized vectors, L2 and cosine rankings align.
- Dimensionality reduction: PCA to reduce dimension (e.g., 512 β 256) with slight recall loss but big memory wins.
Worked examples
Example 1: Near-duplicate product images
Goal: mark near-duplicates in a 2M image catalog with 384-d embeddings.
- Normalize embeddings to unit length.
- Use cosine similarity; threshold ~0.9 for near-dup (adjust after sampling).
- Index choice: HNSW for high recall and dynamic updates. Start with M=16, efConstruction=200, efSearch=64.
- Process: query each image; if a neighbor above threshold is found, link as duplicate.
Why not IVF-PQ?
IVF-PQ is great for huge scales and memory savings but may slightly lower recall around tight thresholds. If dedup precision matters, start with HNSW. Revisit once metrics are stable and memory pressure grows.
Example 2: 1M images, 512-d, fast search API
- Constraints: P95 latency < 50 ms; recall@10 β₯ 0.95.
- Candidate: HNSW (good latency and recall). Try M=16β32; efSearch=64β128.
- Memory estimate: embeddings 1,000,000 Γ 512 Γ 4 B β 2.05 GB; plus HNSW overhead (~64β128 B/vec) β 64β128 MB; total ~2.1β2.2 GB.
- Outcome: likely meets latency on a single machine; tune efSearch upward if recall is low.
Example 3: 10M images, 256-d, memory constrained
- Constraints: 32 GB RAM; P95 < 120 ms; recall@10 β₯ 0.9.
- Candidate: IVF-PQ with OPQ.
- Setup: nlist β sqrt(10M) β 3162 (round to 4096). PQ: m=16, 8 bits/code β 16 B per vector code. Coarse centroids and codebooks add overhead.
- Memory estimate: 10M Γ 16 B β 160 MB codes + IVF lists and centroids (hundreds of MB). Fits in RAM with headroom.
- Tune: nprobe 32β256; increase nprobe to raise recall; re-ranking top candidates using original vectors if needed.
How to choose an index (quick playbook)
Step 1 β Metric and normalization
- If using CLIP-like embeddings: L2-normalize and use cosine similarity (or inner product with normalized vectors).
- Keep metric consistent across training, evaluation, and serving.
Step 2 β Scale and update pattern
- < 1M vectors: HNSW or Flat (if latency budget allows).
- 1Mβ50M: HNSW (if RAM ok) or IVF-PQ for memory efficiency.
- Frequent inserts/deletes: HNSW handles dynamic updates better than IVF-PQ.
Step 3 β Memory and compression
- Estimate raw: N Γ dim Γ 4 bytes (float32).
- Need big savings? Use PQ/OPQ, possibly with on-disk storage + caching.
- Optional PCA down to 256 or 192 dims if recall impact is acceptable.
Step 4 β Tuning knobs
- HNSW: increase efSearch for higher recall; M tunes graph connectivity.
- IVF: increase nprobe to visit more clusters; increase nlist for finer partitions.
- PQ: increase code size (e.g., m=16 vs 8) for better recall at higher memory.
Step 5 β Filtering and re-ranking
- If metadata filters are common: pre-filter candidate partitions or run vector search then filter and, if needed, broaden probe.
- Re-rank top candidates using original (non-quantized) vectors to recover accuracy.
Common mistakes and self-check
- Mistake: Mixing metrics (e.g., cosine in training, L2 in serving). Self-check: verify normalization and metric match end-to-end.
- Mistake: Underestimating memory overhead. Self-check: include index structure overhead (graph links, centroids, codes).
- Mistake: Tuning only for average latency. Self-check: verify P95/P99 under realistic concurrency.
- Mistake: Ignoring cold-start. Self-check: warm caches or pre-load index segments; measure first-query latency.
- Mistake: Skipping re-ranking after PQ. Self-check: compare recall/precision with and without exact re-score on top hits.
- Mistake: Too few evaluation queries. Self-check: use diverse queries, categories, and edge cases; report recall@k and nDCG.
Exercises
These mirror the tasks in the Exercises panel below.
Exercise 1 β Choose the right index for a 5M-image search
Scenario: 5,000,000 images, 384-d float32 embeddings; target P95 < 100 ms; recall@10 β₯ 0.9; RAM 64 GB; frequent filter by region and category; daily inserts of 50k images.
- Estimate raw memory for embeddings.
- Pick metric and normalization.
- Propose an index type and parameters.
- Account for filtering and updates.
- Provide a brief justification and memory estimate including index overhead.
Deliverable: a short plan (bullets) with numbers.
Exercise 2 β Tune recall vs latency
Choose either HNSW or IVF-PQ and propose a tuning plan to reach recall@10 β₯ 0.92 with P95 < 80 ms on 2M vectors (512-d). Include 3 parameter trials and your expected trend for recall and latency. Specify when you would add re-ranking.
- List parameters for each trial (e.g., HNSW efSearch; IVF nprobe; PQ code size).
- Predict which trial hits the target first.
Practical projects
- Visual product search: build an HNSW index for a 1M catalog; measure recall@10 vs efSearch; add category filter and re-ranking.
- Dataset deduplication: use cosine similarity thresholds to cluster near-duplicates; compare HNSW vs Flat on a 200k sample.
- Landmark retrieval: compress with IVF-OPQ; test nlist and nprobe settings; evaluate memory savings and P95 latency.
Who this is for
- Computer Vision Engineers building retrieval, deduplication, or recommendation features.
- ML practitioners deploying embedding-based search services.
Prerequisites
- Basics of embeddings and similarity metrics (cosine, L2).
- Comfort with batch processing and evaluating recall/latency.
Learning path
- Start: Understand metrics and normalization.
- Then: Compare Flat, HNSW, IVF, and PQ trade-offs.
- Next: Practice tuning efSearch/nprobe and adding re-ranking.
- Finally: Evaluate with recall@k and P95/P99 latency under load.
Next steps
- Instrument your service for recall@k and tail latency.
- Design a small offline grid search for index parameters.
- Plan an A/B test to validate quality improvements.
Mini challenge
Design a migration plan from HNSW to IVF-PQ for a 20M-image system under 32 GB RAM. Specify how you will maintain recall during the switch, which parameters you will sweep first, and how you will roll back if quality drops.
Quick test
Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users will have their progress saved.