Why this matters
As an NLP Engineer, retrieval quality shapes your entire RAG or semantic search pipeline. Vector indexes determine whether your system returns relevant passages within tight latency budgets. You will routinely choose an index type, set parameters, and validate recall/latency trade-offs for datasets ranging from thousands to hundreds of millions of text embeddings.
- Powering RAG: fetch top-k passages fast and accurately.
- Semantic search: scale to millions of docs while keeping latency low.
- Personalization and recommendations: nearest neighbor queries at scale.
Note: The quick test on this page is available to everyone; only logged-in users will have progress saved.
Concept explained simply
A vector index is a data structure that lets you find the nearest vectors (neighbors) to a query vector without scanning everything. It trades off speed, memory, and accuracy. Most NLP systems use Approximate Nearest Neighbor (ANN) indexes to get big speedups with small, controlled accuracy loss.
Mental model
Imagine your documents as points in space. A query is another point. The job is to quickly find the closest points. A brute-force approach measures distance to every point (accurate but slow at scale). An ANN index organizes space so you only check a smart subset, then optionally re-check the top few precisely (re-ranking).
Core building blocks
Distance metrics
- Cosine similarity (scale-invariant). Common for text embeddings. Often used via cosine distance or dot product on normalized vectors.
- Dot product / MIPS (Maximum Inner Product Search). If vectors are normalized, dot product ordering equals cosine.
- L2 (Euclidean). Sometimes used when magnitude carries meaning.
Which to choose?
- Text embeddings typically: cosine, or dot product if you normalize vectors.
- If your model docs recommend a metric, follow that.
- Be consistent: train/evaluate with the same metric and normalization.
Index types (common choices)
- Flat (brute force): exact, simple, great for small sets (<=50k–100k), slow at large scale.
- HNSW: graph-based ANN. High recall with low latency, good for millions. Key params: M (graph connectivity), efConstruction (build-time quality), efSearch (query-time recall/latency).
- IVF (Inverted File): clusters vectors into nlist centroids; search only nprobe clusters per query. Faster with large datasets.
- PQ / OPQ (Product Quantization): compress vectors into short codes; big memory savings with some accuracy loss. Often used with IVF: IVF-PQ.
- Disk-backed indexes: keep most data on disk, read minimal portions at query time (useful for 100M+ scale when RAM is tight).
Filtering and hybrid search
- Filtering: apply metadata constraints (e.g., language="en"). Efficient systems use pre-filters (inverted/bitmap indices) or post-filter top ANN candidates.
- Hybrid: combine dense similarity with sparse lexical scores (e.g., BM25 + ANN). Often improves relevance.
Key trade-offs
- Recall vs. latency: higher recall usually means slower queries. Tune efSearch (HNSW) or nprobe (IVF) to balance.
- Memory vs. accuracy: PQ reduces memory but can lower recall; re-ranking helps.
- Build/update cost: HNSW supports dynamic inserts well; IVF may require periodic re-training if distribution shifts.
Worked examples
Example 1: Cosine vs dot product
Let q=[1,0], a=[1,0], b=[0.8,0.6].
- Cosine(q,a)=1.0. Cosine(q,b)=0.8.
- If all vectors are normalized, dot(q,a)=1.0; dot(q,b)=0.8. Ordering matches cosine.
Takeaway: normalize when using dot product to mimic cosine.
Example 2: Choose an index by scale
- Dataset: 40k docs (768-dim). Latency budget: 30 ms. Choice: Flat with batch BLAS or small HNSW. Flat is simplest and exact.
- Dataset: 8M docs. Budget: 50 ms P95. Choice: HNSW or IVF-PQ. Start HNSW for high recall; move to IVF-PQ if RAM is a constraint.
Example 3: Tuning HNSW
- Start: M=16, efConstruction=200, efSearch=64.
- Measure recall@10 and latency. If recall low, raise efSearch to 128; if latency spikes, try M=12, efConstruction=150.
- Re-check recall/latency; keep a change log.
Example 4: IVF basics
- Pick nlist roughly sqrt(N) to start; for N=10M, try nlist=10k.
- Tune nprobe: higher nprobe increases recall but costs latency. Try 32, 64, 128.
Example 5: Memory sizing with PQ
768-dim float32 = ~3 KB/vector. 10M vectors ~30 GB raw. With PQ (e.g., m=96, 8 bits/code) you store ~96 bytes/vector plus overhead (~1.2 GB + overhead), trading some accuracy for large memory savings.
How to choose an index
- Estimate scale: N, dimension d, QPS, latency SLOs.
- Check memory budget and update/insert needs.
- Start simple: Flat for small sets; HNSW for 1M–20M; IVF(+PQ) for 10M+ with tight RAM.
- Tune query-time knobs: efSearch (HNSW) or nprobe (IVF).
- Add re-ranking for quality: re-score top-K with the exact metric or a cross-encoder.
Quick decision guide
- Small (<=100k): Flat or small HNSW.
- Medium (100k–20M): HNSW first. Use efSearch to hit recall SLOs.
- Very large (20M+): IVF-PQ or disk-backed HNSW/IVF. Add re-ranking.
Implementation notes (practical)
- Normalization: If using cosine, L2-normalize vectors at ingest and query.
- Dim reduction: PCA to 256–384 dims can speed up and save memory with modest quality loss. Re-normalize after PCA if using cosine.
- Updates: HNSW handles inserts; for IVF, consider periodic re-clustering if distribution shifts.
- Filters: Pre-filter with metadata when the match set is sparse; post-filter top-K if filters are broad.
- Evaluation: Track recall@k, latency (P50/P95), memory, build time. Fix a validation set and keep it consistent.
Common mistakes and self-check
- Mixing metrics: Training/evaluation with cosine but serving with L2. Self-check: verify one consistent metric end-to-end.
- No normalization for cosine-like use: Self-check: confirm vectors and queries are normalized.
- Over-compression with PQ causing relevance drop. Self-check: A/B recall before/after PQ and add re-ranking.
- Untuned efSearch/nprobe. Self-check: sweep values and log recall/latency curves.
- Ignoring filters until late. Self-check: test with real filters early; measure impact.
Exercises
These mirror the tasks in the Exercises panel below.
Exercise 1: Cosine vs. dot product and a tiny retrieval
- Given q=[0.6,0.8], d1=[0.6,0.8], d2=[0.9,0.1], d3=[-0.6,-0.8]. Compute cosine similarities to q.
- Rank d1,d2,d3 by similarity. Which metric would you use for text embeddings and why?
Expected output
Cosine(q,d1)≈1.0; Cosine(q,d2)≈0.62; Cosine(q,d3)≈-1.0. Ranking: d1 > d2 > d3. Choose cosine (or dot with normalization) for text embeddings.
Exercise 2: Choose an index for 5M documents
Scenario: 5,000,000 documents, 768-dim embeddings, latency P95 ≤ 60 ms, QPS=80, RAM budget ~16 GB, moderate inserts.
- Pick an index type and key parameters.
- Estimate memory roughly.
- Describe how you would tune for recall without breaking latency.
Expected output
Choice: IVF-PQ or HNSW depending on RAM. Example plan: IVF-PQ with nlist≈sqrt(5M)≈2200–4000, nprobe tuned 32–128, PQ m≈96 (8 bits). Memory ≈ a few GB for codes + overhead. Tune nprobe upward to raise recall; add top-K exact re-rank.
- Checklist:
- I computed cosine similarities correctly.
- I stated a clear index choice with parameters tied to constraints.
- I included a recall/latency tuning plan.
Mini challenge
Design a retrieval plan for 30M multilingual docs with per-language filters and P95 ≤ 80 ms. Write a short plan: index type, filter strategy (pre vs post), re-ranking approach, and the first three parameters you would sweep.
Who this is for
- NLP Engineers building RAG or semantic search.
- Data/ML Engineers deploying vector search at scale.
- Researchers prototyping retrieval systems.
Prerequisites
- Basics of embeddings and similarity.
- Comfort with linear algebra at the level of dot products and norms.
- Python familiarity for experimentation.
Learning path
- Review embeddings and similarity metrics (cosine, dot, L2).
- Practice small-scale retrieval with Flat.
- Learn and tune HNSW parameters (M, efConstruction, efSearch).
- Scale to IVF and IVF-PQ; add re-ranking.
- Add metadata filtering and hybrid dense+lexical scoring.
- Build evaluation: recall@k, P95 latency, memory.
Practical projects
- Build a mini semantic search over 200k docs. Compare Flat vs HNSW latency and recall.
- Create a 5M-doc synthetic corpus. Train IVF-PQ, tune nprobe, and add exact re-ranking on top-50.
- Implement filters: language and date range. Compare pre-filter vs post-filter performance.
Next steps
- Introduce cross-encoder or reranker models for top-K re-scoring.
- Add monitoring: drift in recall, latency percentiles, and memory trends.
- Experiment with dimensionality reduction (PCA) and re-normalization.