How to learn Building Retrieval Evaluation Sets for Embeddings And Retrieval in NLP Engineer for free

Why this matters

As an NLP Engineer, you will ship retrieval systems for chatbots, question answering, search, and RAG pipelines. A clear, trustworthy evaluation set lets you:

Measure if users can find what they need within the top-k results.
Compare BM25 vs. embeddings vs. rerankers with stable metrics.
Debug failure modes with hard negatives and stratified queries.
Track progress over time without relying on subjective demos.

Who this is for

NLP Engineers and Data Scientists building search, QA, or RAG systems.
ML Engineers integrating embedding-based retrieval in products.
Analysts validating knowledge base coverage and search quality.

Prerequisites

Basic Python and familiarity with vectors, cosine similarity, and BM25.
Understanding of documents, queries, relevance labels (binary or graded).
Comfort with simple metrics: precision/recall, top-k, ranking positions.

Learning path

Before this: Text normalization, embeddings basics, BM25, top-k retrieval.
This lesson: Build reliable retrieval evaluation sets and protocols.
After this: Reranking evaluation, RAG end-to-end evaluation, error analysis automation.

Concept explained simply

A retrieval evaluation set is a compact, labeled dataset of queries and documents that tells you if your retriever brings the right items to the top. Each query has one or more relevant documents. You compute metrics like Recall@k or MRR to see how well the system ranks them.

Mental model

Imagine a well-curated quiz. Each quiz question (query) has one or more correct answers (relevant docs). Your retriever is the student placing answers in order. The scorecard (metrics) reads the top of the stack and awards points when correct answers appear early.

Designing an evaluation set (step-by-step)

Define use cases and users
Examples: internal wiki search, FAQ retrieval, product troubleshooting, policy lookup.
Enumerate query types
Head queries (popular), tail queries (rare), synonyms/misspellings, entity-heavy, multi-intent, temporal queries. Keep counts to balance later.
Collect a corpus slice
Stable snapshot of documents with IDs, titles, bodies, and metadata. Freeze versions.
Create seed queries
Source from real logs (de-identified), user interviews, or rewrite of document titles/intents. Avoid copying entire sentences verbatim from a document.
Label relevance
Start with binary (0/1). If order among multiple relevant docs matters, use graded labels (e.g., 2 = highly relevant, 1 = partially relevant, 0 = not relevant).
Mine negatives
Include easy negatives (random from corpus) and hard negatives (top-ranked by a baseline but judged not relevant). Hard negatives reveal ranking weaknesses.
Split carefully
Make dev/test splits with leakage checks (no near-duplicates across splits; consider entity- or time-based grouping). Keep test set untouched during tuning.
Size guidance
Aim for 50–200 queries for a first useful set, with at least 1–3 relevant docs per query. Stratify by query type so each slice has coverage.
Quality controls
Write a labeling guide; include blind duplicates (5–10% of queries) to estimate consistency; resolve disagreements.

Labeling guide template

Relevance=2: Direct answer or exact match to the task.
Relevance=1: Helpful but incomplete or tangential.
Relevance=0: Not helpful.
Edge cases: prefer precision when safety matters; prefer recall when discovery matters.

Metrics and protocols

Recall@k: Fraction of queries where at least one relevant doc is in the top-k. Great for QA/assistants.
MRR@k: Emphasizes how early the first relevant result appears.
nDCG@k: Supports graded labels; rewards placing highly relevant items earlier.
Precision@k: Useful when result slots are scarce and must be correct.

Evaluation protocol checklist

Freeze corpus and queries; record versions and random seeds.
Deduplicate near-identical docs across splits.
Use the same tie-breaking and preprocessing across systems.
Report stratified metrics by query type (e.g., entity, tail, misspellings).
Include confidence intervals via bootstrap if available.

Worked examples

Example 1 — FAQ assistant

Corpus: 1,200 FAQ entries with titles and answers.
Queries: 80 (40 head, 40 tail). Labels: binary.
Negatives: 10 random per query; 10 hard negatives from BM25 top 20 minus positives.
Metrics: Recall@5, MRR@10, nDCG@10.
Outcome pattern: BM25 strong on head queries; embeddings improve tail and paraphrases; hard negatives reveal term-matching traps.

Example 2 — Product troubleshooting search (graded)

Corpus: 5,000 troubleshooting guides, repeated patterns.
Queries: 100 categorized by symptom, device, and error code.
Labels: 2 (exact fix), 1 (partial), 0 (irrelevant).
Metrics: nDCG@10 primary, Recall@5 secondary.
Outcome pattern: Reranking boosts nDCG by prioritizing exact fixes above similar-but-wrong models.

Example 3 — Policy and compliance lookup

Corpus: 2,800 policy docs; similar sections across versions.
Queries: 60 with time-sensitive phrasing.
Leakage control: time-based split; no doc from the same policy version appears in both dev and test.
Metrics: Recall@10 (must find a valid policy), MRR@10.
Outcome pattern: Chunking + metadata filters raise Recall; hard negatives expose outdated policy confusion.

Collect judgments quickly (and reliably)

Draft a one-page labeling guide with 5–10 examples.
Pilot with 10 queries, calibrate, then scale labeling.
Insert 5–10% duplicates to check consistency.
Have a tie-breaker reviewer for disagreements.
Track per-annotator agreement and give feedback.

Prevent leakage and bias

Split by entity/time to prevent cross-contamination.
Remove near-duplicates across splits (minhash or simple hashing of normalized text).
Keep test set frozen; don’t tune on it.
Balance query types; report stratified results.

Simple split hygiene steps

Normalize text (lowercase, strip punctuation).
Compute hashes; flag duplicates.
Group by entity/project/version; split at group level.
Recount distribution of query types per split.

Exercises you can do now

Everyone can take the test and do exercises for free. If you log in, your progress will be saved automatically.

Exercise 1 — Build a balanced FAQ retrieval eval set (50 queries)

Mirror of the exercise below (ID: ex1). Create a 50-query set from an FAQ corpus, with binary labels, easy and hard negatives, a clean dev/test split, and compute Recall@5 and MRR@10.

Deliverables: a CSV of labeled pairs and a metrics report.
Tip: keep 25 head and 25 tail queries; add 10–20 hard negatives per query.

Exercise 2 — Mine hard negatives and check split hygiene

Mirror of the exercise below (ID: ex2). Generate hard negatives from a baseline, ensure no leakage across splits, and compare metrics before vs. after adding hard negatives.

Deliverables: counts summary and nDCG@10/Recall@5 comparison.
Tip: time- or entity-based splitting prevents subtle leakage.

Metric calculation snippets (Python)

from math import log2

def recall_at_k(ranks, k=5):
    # ranks: list of 1-based positions of any relevant doc per query (or [] if none found)
    hits = sum(1 for r in ranks if any(pos <= k for pos in r))
    return hits / len(ranks)

def mrr_at_k(ranks, k=10):
    total = 0.0
    for r in ranks:
        first = min([pos for pos in r if pos <= k], default=None)
        total += 1.0/first if first else 0.0
    return total / len(ranks)

def ndcg_at_k(gains, k=10):
    # gains: list of per-query lists of graded gains for top-k positions (e.g., [2,0,1,...])
    def dcg(gs):
        return sum(g / log2(i+2) for i, g in enumerate(gs))
    scores = []
    for g in gains:
        gk = g[:k]
        ideal = sorted(gk, reverse=True)
        scores.append(dcg(gk) / (dcg(ideal) or 1.0))
    return sum(scores)/len(scores)

Common mistakes and how to self-check

Leakage: Same or near-identical docs across splits. Self-check: hash normalized text and scan overlaps.
Unbalanced queries: Only head queries. Self-check: count per query type and rebalance.
No hard negatives: Metrics look inflated. Self-check: ensure 10–20 strong confusers per query.
Vague labels: Annotators disagree. Self-check: include a labeling guide and blind duplicates.
Single metric: Only Recall@k. Self-check: add MRR and nDCG when graded labels exist.

Practical projects

Company wiki search: 100-query set with entity, acronym, and paraphrase slices; report stratified metrics.
Product catalog: multi-attribute queries (brand, model, feature); graded labels; compare BM25 vs. embeddings + reranker.
Policy lookup: time-based splits; evaluate chunking strategies and metadata filters.

Mini challenge

Create 10 queries that stress-test your system (tail terms, synonyms, misspellings, temporal). Label them, compute Recall@5 and MRR@10, and write 3 bullet points on what you’d change to improve performance.

Next steps

Automate metric computation and reporting for your dev set.
Add reranking and compare deltas on hard negatives.
Extend with graded labels and move to nDCG@k as a primary metric.

Ready to check your understanding? Open the Quick Test below. Everyone can take it for free; log in if you want your score saved.

Instructions

Pick or mock a small FAQ corpus (e.g., 1,000–2,000 entries). Assign stable document IDs.
Design 50 queries: 25 head (common) and 25 tail (rare/paraphrased). Avoid copying FAQ titles verbatim.
For each query, identify 1–3 relevant FAQs. Use binary labels (1 = relevant, 0 = not).
Add negatives: 10 random negatives and 10 hard negatives per query (hard = from a baseline top-20 but judged 0).
Split: 35 queries for dev, 15 for test. Ensure no near-duplicate documents cross splits (normalize text and hash).
Compute metrics on dev and test: Recall@5, MRR@10 (and nDCG@10 if you choose graded labels).
Document your labeling guide, distribution of query types, and any disagreements resolved.

Python metric helper

# See lesson for functions; example usage:
# ranks: list of lists with 1-based positions of any relevant docs per query
# gains: list of per-query graded gains at top-k positions if using graded labels

Menu

Building Retrieval Evaluation Sets

Table of Contents