How to learn Retrieval Metrics Recall Mrr Ndcg for Embeddings And Retrieval in NLP Engineer for free

Why this matters

As an NLP Engineer building search, RAG, or Q&A systems, you must know if users will actually see relevant results quickly. Recall@k, MRR, and nDCG are the core metrics for ranking quality. You will use them to compare retrievers, pick k for top-k, and tune embeddings or re-rankers.

Choosing between two embedding models
Setting top-k in a RAG pipeline so answers have enough context
Measuring the impact of a re-ranking stage
Reporting clear improvements to stakeholders

Who this is for

NLP Engineers and Data Scientists working on search, RAG, chatbots with knowledge retrieval, or recommendation-like ranking.
Engineers integrating vector databases and wanting objective evaluation.

Prerequisites

Basic understanding of embeddings and nearest-neighbor retrieval
Familiarity with relevance labels: binary (relevant/not) and graded (0,1,2,3)
Comfort with simple arithmetic and logarithms

Concept explained simply

Retrieval is about ordering documents so the best ones appear first. Metrics answer: how many relevant items did we find (Recall), how early did we find the first relevant item (MRR), and how well did we order items overall, especially with graded relevance (nDCG).

Mental model

Imagine a to-do list sorted by importance. Recall says how many important tasks you included. MRR rewards you for having the first important task near the top. nDCG rewards you for placing the most important tasks higher than the less important ones, with a diminishing bonus the lower they appear.

Key definitions and quick formulas

Recall@k: For a query with R relevant items total, and r relevant found in the top k, Recall@k = r / R. If R = 0, either skip this query in averages or define as 0. Be consistent.
MRR (Mean Reciprocal Rank): For each query, find the rank of its first relevant result, r. Its reciprocal rank is 1/r. If no relevant is retrieved, use 0. MRR is the mean across queries.
nDCG@k (Normalized Discounted Cumulative Gain): DCG@k = sum over positions i=1..k of (rel_i / log2(i+1)). IDCG@k is the DCG@k of the ideal (best possible) ordering. nDCG@k = DCG@k / IDCG@k. If IDCG@k = 0 (no relevant), skip or set 0. With binary relevance, rel_i in {0,1}; with graded relevance, use levels like 0,1,2,3.

Edge cases to decide upfront

Queries with zero relevant items: skip in averages or set metrics to 0. Document the choice.
Ties in ranking: break ties consistently (e.g., by score then ID).
k selection: pick a k aligned to your UI (e.g., top-5 cards shown) or downstream needs (e.g., RAG context size).

Worked examples

Example 1: Recall@5 for one query

Relevant set: {D2, D4, D7}. Retrieved top-5: [D3, D4, D8, D2, D9]. Relevant in top-5: D4 and D2 → r = 2. Total relevant R = 3. Recall@5 = 2/3 ≈ 0.667.

Example 2: MRR across three queries

Q1: first relevant at rank 2 → 1/2 = 0.5
Q2: first relevant at rank 1 → 1/1 = 1.0
Q3: first relevant at rank 7 → 1/7 ≈ 0.143

MRR = mean(0.5, 1.0, 0.143) ≈ 0.548.

Example 3: nDCG@5 with graded relevance

Grades: A(3), B(2), C(1), others 0. Retrieved top-5: [C, D, A, E, B] → grades: [1, 0, 3, 0, 2].

DCG@5 = 1/log2(2) + 0/log2(3) + 3/log2(4) + 0/log2(5) + 2/log2(6) ≈ 1 + 0 + 1.5 + 0 + 0.773 = 3.273
IDCG@5 (ideal [A, B, C]) = 3/1 + 2/1.585 + 1/2 ≈ 3 + 1.262 + 0.5 = 4.762

nDCG@5 ≈ 3.273 / 4.762 ≈ 0.687.

Example 4: Aggregating across queries

Compute metrics per query, then average (macro-average). For recall and nDCG, skip queries with no relevant items or set to 0 consistently. For MRR, queries with no found relevant contribute 0.

Self-check questions for the examples

Why does MRR focus only on the first relevant item? Because users often stop after they find the first good result.
When might Recall@k be high but MRR low? When relevant items exist but are mostly late in the ranking.
Why prefer nDCG over recall when relevance is graded? nDCG rewards ordering by importance, not just presence.

How to compute in practice

Define relevance labels (binary or graded) and the evaluation set of queries.
Choose k to match product needs (e.g., top-5 shown, top-20 for RAG retrieval pool).
For each query, record the ranked list and corresponding relevance labels.
Compute per-query metrics: Recall@k, RR (for MRR), DCG@k and nDCG@k.
Aggregate across queries (macro average). Report mean and also median or distribution if possible.

Implementation tips

Use stable sorting and fixed tie-break rules.
For nDCG, precompute IDCG per query for speed.
Log queries with zero relevant items separately; consider removing them from averages or report both variants.

Common mistakes and how to self-check

Mixing precision with recall: Recall counts how many relevant you found, not how many returned were relevant.
Including queries with zero relevant in averages without noting it. Self-check: recompute metrics after excluding them and compare.
Using MRR with graded labels incorrectly. Reminder: MRR ignores grades; it only uses the first relevant rank.
Computing nDCG without normalization. Self-check: ensure your ideal ordering produces nDCG = 1.0.
Comparing systems at different k. Self-check: keep k fixed across systems.

Exercises

Try this dataset and calculate by hand or in a notebook.

Exercise 1 (mirrors the interactive exercise below)
- Q1 relevant: {A:3, C:2, F:1}. Retrieved@3: [B, C, A].
- Q2 relevant: {K:2}. Retrieved@3: [L, M, N].
Compute: Recall@3 (macro), MRR, nDCG@3 (macro).

[ ] I computed per-query metrics before averaging.
[ ] I handled the query with no found relevant in MRR as 0.
[ ] I computed IDCG per query correctly for nDCG.
[ ] I reported values to 3 decimal places.

Need a nudge?

For Q1: first relevant rank is 2. For nDCG, use logs base 2: log2(2)=1, log2(3)≈1.585, log2(4)=2.

Mini challenge

You have two retrievers. At k=5, System A has Recall@5=0.70, MRR=0.40, nDCG@5=0.62. System B has Recall@5=0.65, MRR=0.52, nDCG@5=0.60. Which would you pick for a chatbot that must show a good citation at the top? Explain briefly using the metrics.

Example reasoning

Favor B because higher MRR means users see a relevant citation earlier, which matters for trust in a single top answer, even if overall recall is slightly lower.

Practical projects

Evaluate your current RAG retriever at k ∈ {3,5,10}. Report Recall@k, MRR, nDCG@k. Pick k based on the best trade-off.
Add a lightweight re-ranker. Measure changes in MRR and nDCG while holding k constant.
Create a small labeled set (20–50 queries with graded relevance). Compare two embedding models and summarize metrics with confidence intervals (bootstrap).

Learning path

Before this: Embedding models and similarity search basics
This lesson: Retrieval metrics (Recall, MRR, nDCG)
Next: Re-ranking techniques and hybrid search evaluation

Next steps

Run the exercises on your own data.
Document your evaluation protocol: labels, k, edge-case handling.
Take the quick test to confirm understanding.

Progress and quick test

The quick test is available to everyone. If you are logged in, your progress will be saved automatically.

Menu

Retrieval Metrics Recall Mrr Ndcg

Table of Contents