Why this matters
Reranking and context selection determine whether your RAG system feeds the LLM the most useful evidence. Strong reranking reduces hallucinations, cuts token costs, and improves answer accuracy.
- Support QA: pick the few most relevant policy snippets from hundreds of pages.
- Code assistant: surface the exact function docstring instead of a whole file.
- Analytics copilot: prioritize the freshest dashboard notes over outdated ones.
Who this is for
- NLP engineers building RAG chatbots, assistants, and search interfaces.
- Data scientists evaluating retrieval pipelines and precision/recall trade-offs.
- Backend engineers integrating retrieval, ranking, and LLM orchestration.
Prerequisites
- Basic RAG pipeline knowledge: chunking, embeddings, vector and keyword retrieval.
- Comfort with cosine similarity, BM25 (or similar), and simple scoring formulas.
- Familiarity with latency/cost constraints and token budgeting.
Concept explained simply
Think of reranking as a second opinion. Retrieval gives you a rough top-k. Reranking reorders those candidates using better signals (like a cross-encoder) and picks the final few chunks to send to the LLM.
Mental model: a funnel.
- Wide: Retriever returns many candidates (high recall, lower precision).
- Narrow: Reranker scores and reorders (higher precision).
- Final: Context selector builds a compact, diverse, non-redundant window.
Key terms
- Top-k: number of candidates taken from the retriever.
- Reranker: model or rule that reorders candidates (e.g., cross-encoder).
- MMR: Maximal Marginal Relevance; increases diversity and reduces redundancy.
- Hit@k / Recall@k: whether a relevant item appears in top-k.
- NDCG/MRR: ranking quality metrics emphasizing order.
What is a cross-encoder reranker?
A cross-encoder takes the query and a single candidate together and outputs a relevance score. It is slower than a bi-encoder but usually more accurate for small k (e.g., k ≤ 50).
Workflow: from retrieval to final context
- Ingest & chunk: Split documents into coherent, small chunks (e.g., 200–400 tokens) with overlap.
- Retrieve candidates: Use dense, sparse, or hybrid retrieval to get top-k (e.g., 50).
- Normalize & filter: Normalize scores per source; drop exact duplicates and near-duplicates.
- Score features: Combine signals: BM25, dense similarity, source priority, recency, section type.
- Rerank: Apply a cross-encoder or a weighted-rule to reorder the top-k.
- Diversify: Use MMR or clustering to avoid redundant chunks.
- Select context: Pack the highest-value, diverse chunks into the token budget; keep citations/IDs.
- Validate: Dry-run a few queries, check if evidence truly supports correct answers.
- Log: Store scores, selections, and outcomes for iteration.
Worked examples
Example 1 — FAQ bot for HR policies
Query: "How many vacation days can I carry over?" Retriever returns 20 chunks. Signals available: BM25, dense sim, source (policy handbook > forum), recency.
- Rule: score = 0.35*bm25 + 0.45*dense + 0.15*source_priority + 0.05*recency_boost
- After scoring, pick top 8, then apply MMR to select final 5, ensuring at least two distinct sections (policy index + detailed clause).
- Result: concise context with exact clause and examples; the LLM returns a precise, cited answer.
Example 2 — Code assistant for Python repo
Query: "How do I initialize the Client with OAuth?" Retriever returns 30 snippets mixing README, code, and tests.
- Boost docstrings and README sections; downweight tests.
- Use cross-encoder reranker on top-40; then exclude near-duplicate snippets of the same function.
- Final pack: docstring + usage snippet + config section. The LLM answers with correct parameters and a short example.
Example 3 — Multi-source search with freshness
Query: "Latest refund policy for digital purchases" across knowledge base and release notes.
- Normalize per-source scores; add recency boost for last 30 days.
- Weighted sum pre-rank → cross-encoder rerank → MMR to keep one policy and one release note.
- Outcome: includes updated clause; avoids including the obsolete version.
Choosing a reranker and budget tuning
- Cross-encoder: best quality for small k; adds 10–50 ms per pair on GPUs depending on model size.
- Lightweight rule-based: fastest; good when signals are strong and clean.
- Hybrid: rule pre-filter to top-20, then cross-encode to top-5.
- Latency tips: cache frequent pairs, batch reranker calls, and keep k modest (e.g., ≤ 50).
Metrics and evaluation
- Offline: Recall@k, Hit@k, MRR, NDCG@k on labeled query-chunk pairs.
- Online: Answer accuracy with human or LLM grading, citation correctness, latency, token cost.
- Data: include hard negatives (near but wrong) to stress-test reranker and MMR.
- Ablate: compare "retriever-only" vs "+ reranker" vs "+ reranker + MMR".
Common mistakes and self-check
- Using very large chunks, causing noise and token waste.
- Mixing scores across sources without normalization.
- No deduplication → repeated content crowds out diverse evidence.
- Ignoring intent and section types (e.g., examples vs definitions).
- Over-tuning for a few queries; poor generalization.
- Not measuring latency and cost alongside accuracy.
Self-check:
- [ ] Are scores normalized per source before combining?
- [ ] Do top results come from varied sections, not duplicates?
- [ ] Does each selected chunk directly support an answer sentence?
- [ ] Is the final pack within token budget with room for the prompt?
- [ ] Do metrics improve vs a retriever-only baseline?
Practical projects
- Company policy QA: Implement hybrid reranking + MMR; target +10% Hit@5 vs baseline.
- Code search assistant: Cross-encode top-30 snippets; penalize test files; measure citation correctness.
- Reviews search: Add recency and rating signals; ensure final pack includes both summary and a representative review.
Exercises
Do these in a notebook or spreadsheet. They mirror the exercises section below.
Exercise 1 — Weighted reranking and selection (ex1)
For candidates A–F with features below, compute a score and pick the top-4 for context.
- Score = 0.35*bm25 + 0.45*dense + 0.15*source_priority + 0.05*recency_boost
- recency_boost = max(0, (30 - recency_days)/30)
Data:
- A: bm25=0.82, dense=0.60, source_priority=1, recency_days=14
- B: bm25=0.75, dense=0.66, source_priority=0, recency_days=5
- C: bm25=0.70, dense=0.88, source_priority=1, recency_days=20
- D: bm25=0.90, dense=0.40, source_priority=0, recency_days=2
- E: bm25=0.68, dense=0.79, source_priority=1, recency_days=9
- F: bm25=0.60, dense=0.62, source_priority=0, recency_days=35
Output: list the top-4 IDs in order.
Exercise 2 — MMR diversification (ex2)
Given query-to-candidate similarities and pairwise candidate similarities, select k=3 with MMR (lambda=0.7):
- sim(q, d1..d5) = [0.82, 0.80, 0.78, 0.62, 0.60]
- Pairwise sim (symmetric): s12=0.85, s13=0.40, s14=0.20, s15=0.10, s23=0.45, s24=0.15, s25=0.05, s34=0.30, s35=0.25, s45=0.50
At each step, pick argmax of: 0.7*sim(q, di) - 0.3*max_j sim(di, dj in selected).
Output: chosen IDs in order.
Mini challenge
You have a 1,500-token budget. Retriever returns 40 chunks from mixed sources (product docs, blog posts, community answers). Design a selection rule combining: normalized dense score, source priority (docs > blog > community), and MMR. Describe your weights, expected top-5 composition, and how you would validate impact in 1–2 days.
Learning path
- Before: Text chunking and hybrid retrieval; embeddings quality and indexing.
- Now: Reranking and context selection (this lesson).
- Next: Prompting with citations, guardrails, and evaluation pipelines.
Next steps
- Implement a simple weighted reranker; log metrics against a retriever-only baseline.
- Introduce a cross-encoder for the final top-20; measure impact on Hit@5 and latency.
- Iterate MMR lambda and chunk sizes; keep an eye on token spend.
When you are ready, take the Quick Test below. It is available to everyone; sign in to save your progress.
Quick Test
Answer a few questions to check your understanding. Your score is saved if you are signed in.