luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Retrieval Metrics Recall Mrr Ndcg

Learn Retrieval Metrics Recall Mrr Ndcg for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As an NLP Engineer building search, RAG, or Q&A systems, you must know if users will actually see relevant results quickly. Recall@k, MRR, and nDCG are the core metrics for ranking quality. You will use them to compare retrievers, pick k for top-k, and tune embeddings or re-rankers.

  • Choosing between two embedding models
  • Setting top-k in a RAG pipeline so answers have enough context
  • Measuring the impact of a re-ranking stage
  • Reporting clear improvements to stakeholders

Who this is for

  • NLP Engineers and Data Scientists working on search, RAG, chatbots with knowledge retrieval, or recommendation-like ranking.
  • Engineers integrating vector databases and wanting objective evaluation.

Prerequisites

  • Basic understanding of embeddings and nearest-neighbor retrieval
  • Familiarity with relevance labels: binary (relevant/not) and graded (0,1,2,3)
  • Comfort with simple arithmetic and logarithms

Concept explained simply

Retrieval is about ordering documents so the best ones appear first. Metrics answer: how many relevant items did we find (Recall), how early did we find the first relevant item (MRR), and how well did we order items overall, especially with graded relevance (nDCG).

Mental model

Imagine a to-do list sorted by importance. Recall says how many important tasks you included. MRR rewards you for having the first important task near the top. nDCG rewards you for placing the most important tasks higher than the less important ones, with a diminishing bonus the lower they appear.

Key definitions and quick formulas

  • Recall@k: For a query with R relevant items total, and r relevant found in the top k, Recall@k = r / R. If R = 0, either skip this query in averages or define as 0. Be consistent.
  • MRR (Mean Reciprocal Rank): For each query, find the rank of its first relevant result, r. Its reciprocal rank is 1/r. If no relevant is retrieved, use 0. MRR is the mean across queries.
  • nDCG@k (Normalized Discounted Cumulative Gain): DCG@k = sum over positions i=1..k of (rel_i / log2(i+1)). IDCG@k is the DCG@k of the ideal (best possible) ordering. nDCG@k = DCG@k / IDCG@k. If IDCG@k = 0 (no relevant), skip or set 0. With binary relevance, rel_i in {0,1}; with graded relevance, use levels like 0,1,2,3.
Edge cases to decide upfront
  • Queries with zero relevant items: skip in averages or set metrics to 0. Document the choice.
  • Ties in ranking: break ties consistently (e.g., by score then ID).
  • k selection: pick a k aligned to your UI (e.g., top-5 cards shown) or downstream needs (e.g., RAG context size).

Worked examples

Example 1: Recall@5 for one query

Relevant set: {D2, D4, D7}. Retrieved top-5: [D3, D4, D8, D2, D9]. Relevant in top-5: D4 and D2 → r = 2. Total relevant R = 3. Recall@5 = 2/3 ≈ 0.667.

Example 2: MRR across three queries

  • Q1: first relevant at rank 2 → 1/2 = 0.5
  • Q2: first relevant at rank 1 → 1/1 = 1.0
  • Q3: first relevant at rank 7 → 1/7 ≈ 0.143

MRR = mean(0.5, 1.0, 0.143) ≈ 0.548.

Example 3: nDCG@5 with graded relevance

Grades: A(3), B(2), C(1), others 0. Retrieved top-5: [C, D, A, E, B] → grades: [1, 0, 3, 0, 2].

  • DCG@5 = 1/log2(2) + 0/log2(3) + 3/log2(4) + 0/log2(5) + 2/log2(6) ≈ 1 + 0 + 1.5 + 0 + 0.773 = 3.273
  • IDCG@5 (ideal [A, B, C]) = 3/1 + 2/1.585 + 1/2 ≈ 3 + 1.262 + 0.5 = 4.762

nDCG@5 ≈ 3.273 / 4.762 ≈ 0.687.

Example 4: Aggregating across queries

Compute metrics per query, then average (macro-average). For recall and nDCG, skip queries with no relevant items or set to 0 consistently. For MRR, queries with no found relevant contribute 0.

Self-check questions for the examples
  • Why does MRR focus only on the first relevant item? Because users often stop after they find the first good result.
  • When might Recall@k be high but MRR low? When relevant items exist but are mostly late in the ranking.
  • Why prefer nDCG over recall when relevance is graded? nDCG rewards ordering by importance, not just presence.

How to compute in practice

  1. Define relevance labels (binary or graded) and the evaluation set of queries.
  2. Choose k to match product needs (e.g., top-5 shown, top-20 for RAG retrieval pool).
  3. For each query, record the ranked list and corresponding relevance labels.
  4. Compute per-query metrics: Recall@k, RR (for MRR), DCG@k and nDCG@k.
  5. Aggregate across queries (macro average). Report mean and also median or distribution if possible.
Implementation tips
  • Use stable sorting and fixed tie-break rules.
  • For nDCG, precompute IDCG per query for speed.
  • Log queries with zero relevant items separately; consider removing them from averages or report both variants.

Common mistakes and how to self-check

  • Mixing precision with recall: Recall counts how many relevant you found, not how many returned were relevant.
  • Including queries with zero relevant in averages without noting it. Self-check: recompute metrics after excluding them and compare.
  • Using MRR with graded labels incorrectly. Reminder: MRR ignores grades; it only uses the first relevant rank.
  • Computing nDCG without normalization. Self-check: ensure your ideal ordering produces nDCG = 1.0.
  • Comparing systems at different k. Self-check: keep k fixed across systems.

Exercises

Try this dataset and calculate by hand or in a notebook.

Exercise 1 (mirrors the interactive exercise below)
- Q1 relevant: {A:3, C:2, F:1}. Retrieved@3: [B, C, A].
- Q2 relevant: {K:2}. Retrieved@3: [L, M, N].
Compute: Recall@3 (macro), MRR, nDCG@3 (macro).
  • [ ] I computed per-query metrics before averaging.
  • [ ] I handled the query with no found relevant in MRR as 0.
  • [ ] I computed IDCG per query correctly for nDCG.
  • [ ] I reported values to 3 decimal places.
Need a nudge?

For Q1: first relevant rank is 2. For nDCG, use logs base 2: log2(2)=1, log2(3)≈1.585, log2(4)=2.

Mini challenge

You have two retrievers. At k=5, System A has Recall@5=0.70, MRR=0.40, nDCG@5=0.62. System B has Recall@5=0.65, MRR=0.52, nDCG@5=0.60. Which would you pick for a chatbot that must show a good citation at the top? Explain briefly using the metrics.

Example reasoning

Favor B because higher MRR means users see a relevant citation earlier, which matters for trust in a single top answer, even if overall recall is slightly lower.

Practical projects

  • Evaluate your current RAG retriever at k ∈ {3,5,10}. Report Recall@k, MRR, nDCG@k. Pick k based on the best trade-off.
  • Add a lightweight re-ranker. Measure changes in MRR and nDCG while holding k constant.
  • Create a small labeled set (20–50 queries with graded relevance). Compare two embedding models and summarize metrics with confidence intervals (bootstrap).

Learning path

  • Before this: Embedding models and similarity search basics
  • This lesson: Retrieval metrics (Recall, MRR, nDCG)
  • Next: Re-ranking techniques and hybrid search evaluation

Next steps

  • Run the exercises on your own data.
  • Document your evaluation protocol: labels, k, edge-case handling.
  • Take the quick test to confirm understanding.

Progress and quick test

The quick test is available to everyone. If you are logged in, your progress will be saved automatically.

Practice Exercises

1 exercises to complete

Instructions

You have two queries with the following relevance and retrieved results at k=3.

  • Q1 relevant (graded): {A:3, C:2, F:1}. Retrieved@3: [B, C, A].
  • Q2 relevant (graded): {K:2}. Retrieved@3: [L, M, N].

Tasks:

  • Compute per-query: Recall@3, Reciprocal Rank, nDCG@3.
  • Then compute macro averages across Q1 and Q2.

Use log base 2: log2(2)=1, log2(3)≈1.585, log2(4)=2.

Expected Output
Macro Recall@3 ≈ 0.333; MRR ≈ 0.250; Macro nDCG@3 ≈ 0.290

Have questions about Retrieval Metrics Recall Mrr Ndcg?

AI Assistant

Ask questions about this tool