How to learn Similarity Search And Reranking Basics for Embeddings And Retrieval in NLP Engineer for free

Why this matters

Similarity search and reranking are the backbone of modern NLP retrieval systems: semantic search, retrieval-augmented generation (RAG), FAQ/chatbots, deduplication, and recommendation. As an NLP Engineer you will:

Build vector indexes to retrieve the most relevant documents for a user query.
Choose similarity metrics (cosine, dot, Euclidean) and tune top-k to balance recall and latency.
Combine lexical (BM25) and semantic retrieval and rerank results to boost precision.
Evaluate and iterate using metrics like Recall@k, MRR, and nDCG.

Concept explained simply

Embeddings map text to points in a high-dimensional space. Similar texts land close together. Similarity search finds the nearest points (documents) to your query point. But the first pass can be rough—fast, not perfect. Reranking then reorders those candidates using a more accurate (often slower) model or rule, improving final relevance.

Mental model

Think of a funnel: Retrieve many candidates quickly → Rerank the shortlist accurately → Select the final few.
Retrieve = breadth (don’t miss anything important). Rerank = precision (get the best at the top).

Core components

Similarity metrics

Cosine similarity (angle): robust when magnitude varies; common for normalized embeddings.
Dot product: similar to cosine if vectors are normalized; used by some models trained for dot-product scoring.
Euclidean (L2): distance; smaller is closer; sensitive to scale.

Tip: If you unit-normalize vectors, cosine, dot product, and negative L2 become closely related for ranking.

Index choices

Brute force: exact, simple, but slow for large corpora.
Approximate Nearest Neighbor (ANN) (e.g., HNSW-like graphs, trees, quantization): sublinear lookup with tunable recall-latency trade-offs.

Typical pattern: ANN retrieve top-50 to top-200 → rerank to top-5 or top-10.

Reranking methods

Cross-encoder reranker: encodes query and each candidate together; accurate but slower.
Heuristic rerankers: MMR (Maximal Marginal Relevance) for diversity; Reciprocal Rank Fusion (RRF) to combine multiple rankers; hybrid BM25 + embeddings.

Choosing k

For RAG: retrieve 30–100, rerank to 5–20.
For search UI: retrieve 50–200, rerank to 10–20 results shown.
Start larger k for safety, then dial down based on latency and recall@k.

Worked examples

Example 1: Cosine similarity by hand

Query q = [0.6, 0.8], Docs: d1 = [1, 0], d2 = [0.6, 0.8], d3 = [-0.6, 0.8].

cos(q, d1) = (0.6*1 + 0.8*0) / (|q||d1|) = 0.6 / (1*1) = 0.6
cos(q, d2) = (0.6*0.6 + 0.8*0.8) / (1*1) = 1.0
cos(q, d3) = (0.6*(-0.6) + 0.8*0.8) = -0.36 + 0.64 = 0.28

Ranking: d2 (1.0) > d1 (0.6) > d3 (0.28).

Example 2: Top-k retrieval

Suppose similarities to the query are: dA=0.72, dB=0.63, dC=0.81, dD=0.59. For k=2, return dC and dA.

For k=3, return dC, dA, dB. Increasing k boosts recall but adds latency.

Example 3: MMR reranking for diversity

Initial similarities to query: A=0.90, B=0.86, C=0.80, D=0.75. Pairwise cosine between docs: sim(A,B)=0.85, sim(A,C)=0.40, sim(A,D)=0.20, sim(B,C)=0.45, sim(B,D)=0.30, sim(C,D)=0.10. Let λ=0.7, choose 3.

Select A first (highest similarity).
Score B: λ*0.86 - (1-λ)*max(sim(B,A)) = 0.7*0.86 - 0.3*0.85 = 0.602 - 0.255 = 0.347
Score C: 0.7*0.80 - 0.3*0.40 = 0.56 - 0.12 = 0.44
Score D: 0.7*0.75 - 0.3*0.20 = 0.525 - 0.06 = 0.465 → pick D second.
Now with S={A,D}, recompute for B and C using max similarity to S:
- B: max(sim(B,A), sim(B,D)) = max(0.85, 0.30)=0.85 → 0.7*0.86 - 0.3*0.85 = 0.347
- C: max(0.40, 0.10)=0.40 → 0.7*0.80 - 0.3*0.40 = 0.44 → pick C third.

Final reranked order: A, D, C (B is dropped due to redundancy with A).

Step-by-step: Build a basic retrieval + rerank pipeline

Prepare data: clean text, split into passages (e.g., 100–300 tokens), store IDs and text.
Embed: generate a vector per passage and normalize if using cosine similarity.
Index: start with brute-force for small sets; move to ANN when latency grows.
Query: embed the query, search top-100 candidates.
Rerank:
- Option A: Cross-encoder score each (query, passage) pair.
- Option B: MMR with λ≈0.5–0.8 to balance relevance vs diversity.
- Option C: RRF combine BM25 and embedding ranks.
Select: keep top-5 to top-10 for display or RAG context.
Evaluate: measure Recall@k and MRR on a small labeled set; adjust k and reranker.

Exercises

Open the exercise cards below and complete them. Then use this checklist:

Computed cosine similarity correctly and ranked documents.
Applied MMR formula and tracked max similarity to the selected set.
Explained when to increase k and when to rely on reranking.

Common mistakes and self-check

Not normalizing vectors when using cosine similarity. Self-check: are norms close to 1.0?
Too small k causing low recall. Self-check: does Recall@k increase significantly when doubling k?
Skipping evaluation. Self-check: do you have at least 20–50 labeled queries to estimate Recall@k and MRR?
Redundant results at the top. Self-check: does MMR or cross-encoder reranking improve diversity and clicks?
Using dot product with non-matching model. Self-check: confirm the embedding model was trained for dot-product or normalize and switch to cosine.

Practical projects

Policy QA search: index company policies, retrieve top-100, cross-encode to top-10, measure Recall@10 and MRR.
Similar question finder: given a new FAQ, find duplicates via cosine; apply MMR to surface diverse related questions.
Hybrid product search: combine BM25 and embeddings with RRF; tune weights to improve nDCG@10.

Who this is for

NLP Engineers building search, RAG, or recommendation components.
Data Scientists improving relevance and ranking quality.

Prerequisites

Basic linear algebra (vectors, dot product, norms).
Familiarity with word/sentence embeddings and text preprocessing.
Understanding of precision/recall and ranking metrics.

Learning path

Embeddings basics → Vector similarity → ANN indexing → Reranking (cross-encoder, MMR, RRF) → Evaluation and tuning.

Next steps

Experiment with different k values and rerankers on a small validation set.
Introduce hybrid retrieval (BM25 + embeddings) and compare against semantic-only.
Automate evaluation with Recall@k and MRR dashboards.

Mini challenge

Scenario: Users search for troubleshooting steps. The first-pass top-100 are accurate but repetitive. You have 200 ms budget for reranking. What would you try first and why?

Suggested approach

Apply MMR (λ≈0.6–0.8) to reduce redundancy within budget.
If budget allows, cross-encode the top-40 and keep top-10; otherwise RRF of BM25 and embeddings.
Validate with Recall@10 and user click-through.

Ready for a quick test?

The quick test below is available to everyone. If you are logged in, your progress will be saved.

Menu

Similarity Search And Reranking Basics

Table of Contents