luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Offline Evaluation For RAG

Learn Offline Evaluation For RAG for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As an NLP Engineer, you will ship Retrieval-Augmented Generation (RAG) systems that answer questions from internal docs, product catalogs, or knowledge bases. Offline evaluation lets you compare variants safely before exposing users. It helps you answer: Is my retriever finding the right passages? Are answers grounded in the retrieved context? Did my change improve quality without breaking something else?

  • Real tasks: Select top retriever for a domain, tune chunking and reranking, enforce citation correctness, and set quality gates for releases.
  • Outcomes: Repeatable metrics, faster iteration cycles, fewer regressions, and clearer communication with stakeholders.

Concept explained simply

RAG is a two-stage pipeline: retrieve relevant passages, then generate an answer using those passages. Offline evaluation is like a lab test: feed a fixed set of questions with known references, run your system, and score the outputs against a rubric.

Mental model

  • Inputs: A set of queries (Q), the source corpus, and ground-truth signals (gold passages and/or gold answers).
  • Process: Run your RAG variant on Q, capture retrieved passages and generated answers, then compute metrics.
  • Outputs: Summary metrics (retrieval, grounding, answer quality), error slices, and recommendations.
Key terms
  • Gold passage: A document chunk that truly contains the answer.
  • Gold answer: A vetted reference answer (may be short or long).
  • Groundedness: The degree to which the answer’s claims are supported by retrieved text.
  • Hallucination: A claim not supported by the provided context.

What to measure in RAG offline evaluation

Retrieval metrics

  • Recall@k: Proportion of queries for which at least one gold passage appears in the top k.
  • Precision@k: Fraction of the top k that are relevant.
  • MRR (Mean Reciprocal Rank): Average of 1/rank of the first relevant result.
  • nDCG@k: Ranking quality that rewards relevant items near the top.

Context quality checks

  • Diversity: Are retrieved passages unique and complementary (not duplicates)?
  • Redundancy: Excessive overlap lowers utility.
  • Coverage: Do retrieved passages collectively contain the answer spans?

Answer-level metrics

  • Groundedness/Faithfulness: Claims are supported by citations from retrieved text.
  • Citation correctness: Citations point to passages that contain the cited facts.
  • Exactness/Accuracy: Matches a gold answer (for factoid Q&A) or meets rubric criteria (for long-form).
  • Helpfulness: Clear, concise, and complete for the user’s intent.
Simple scoring rubrics
  • Groundedness (0–2): 0=Not supported; 1=Partially supported; 2=Fully supported.
  • Helpfulness (1–5): Is the answer correct, complete, and concise?
  • Citation correctness: For each claim with a citation, does cited text contain that claim? (Yes/No)

Designing a reliable evaluation set

  • Balance: Include easy, medium, hard queries; include edge cases and ambiguities.
  • Coverage: Reflect real user intents and important business areas.
  • Golds: Create gold passages and/or gold answers via subject-matter experts or careful reviewer agreement.
  • Synthetic augmentation: Generate queries from documents, then human-review a sample to calibrate.
  • Negative controls: Add queries not answerable from your corpus to detect hallucination behavior.
  • Agreement: Use clear rubrics and measure inter-annotator agreement on a subset.
LLM-as-a-judge tips
  • Give the judge the query, candidate answer, and retrieved passages.
  • Ask for a structured verdict (e.g., groundedness 0–2) and short justification with quoted spans.
  • Calibrate with a few gold-labeled examples and check consistency on a holdout.
  • Avoid revealing system prompts or irrelevant info; keep inputs minimal and comparable.

Offline evaluation workflow (step-by-step)

  1. Freeze components: Fix retriever, reranker, chunking, and generator for each variant.
  2. Assemble evaluation set: 100–300 queries usually give stable trends; include adversarial cases.
  3. Run retrieval: Store top-k doc IDs and scores per query.
  4. Run generation: Produce answers with citations; log prompts/configs.
  5. Score retrieval: Compute Recall@k, MRR, nDCG@k, and coverage.
  6. Score answers: Use rubric and/or LLM-as-judge for groundedness, helpfulness, citation correctness.
  7. Error analysis: Slice by topic, difficulty, query length, and failure modes.
  8. Decide gates: Define thresholds (e.g., Recall@5 ≥ 0.85; Groundedness ≥ 1.6/2).
  9. Compare variants: Report deltas with confidence intervals or bootstrap estimates.
Reporting template
  • Setup: Variant name, components, k, prompt version.
  • Dataset: Query count, domains, gold creation notes.
  • Metrics: Retrieval (Recall@k, MRR, nDCG@k), Answer (Groundedness, Helpfulness, Citations).
  • Findings: Top regressions, representative failures with IDs and snippets.
  • Next actions: Hypotheses, changes to try, new edge cases to add.

Worked examples

Example 1: Internal knowledge base QA

Gold passages per query: Q1={A2,A7}, Q2={B4}, Q3={C3,C6}. Retrieved top-3:

  • Q1: [A5, A2, A9] → first relevant at rank 2
  • Q2: [B4, B1, B7] → first relevant at rank 1
  • Q3: [C8, C5, C3] → first relevant at rank 3
  • Recall@3: 3/3 = 1.00
  • MRR: (1/2 + 1/1 + 1/3)/3 ≈ 0.611
  • nDCG@3 (binary gains): Q1=0.631, Q2=1.000, Q3≈0.307 → avg ≈ 0.646

Interpretation: Retrieval brings at least one relevant passage for all queries, but rank quality can improve (MRR ~0.61).

Example 2: Product Q&A with citations

  • Precision@3: 0.67 (2 of top 3 are relevant on average)
  • Groundedness (0–2): mean 1.7
  • Citation correctness: 92% of cited claims are supported in the cited passages

Interpretation: Good grounding, but remaining 8% incorrect citations are high-risk. Prioritize fixing citation mapping or reranking.

Example 3: Policy chatbot (long-form)

  • Helpfulness (1–5): mean 4.1
  • Hallucination rate on unanswerable controls: 28%
  • Latency and token budget: similar across variants (useful secondary checks offline)

Interpretation: Answers are helpful but hallucinate on controls. Add explicit refusal behavior and strengthen retrieval fallback before deployment.

Exercises

Do these hands-on tasks to cement the workflow. A short checklist follows.

Exercise 1 — Compute retrieval and grounding metrics

Given 3 queries, gold passages, and retrieved results, compute Recall@3, MRR, and nDCG@3. Then compute hallucination rate from 3 judged answers.

Data
  • Gold: Q1={A2,A7}, Q2={B4}, Q3={C3,C6}
  • Retrieved top-3: Q1=[A5,A2,A9], Q2=[B4,B1,B7], Q3=[C8,C5,C3]
  • Answer grounding labels (Supported/Unsupported): [Supported, Supported, Unsupported]

Record: Recall@3, MRR, nDCG@3 (binary gains), Hallucination rate.

Exercise 2 — Draft an LLM-judge rubric and prompt

Create a concise rubric for Groundedness (0–2) and Citation correctness (Yes/No). Write a judge prompt that returns JSON with fields: groundedness, citation_correct, evidence_spans, and justification (short).

Tips
  • Ask the judge to quote minimal evidence spans.
  • Keep outputs structured and deterministic.

Exercise checklist

  • Computed retrieval metrics correctly
  • Computed hallucination rate
  • Wrote a clear, compact judge rubric
  • Produced a structured judge prompt

Common mistakes and self-check

  • Only measuring answer quality: Always measure retrieval too; poor retrieval caps answer quality.
  • Unclear rubrics: Leads to noisy labels. Write short scales with examples.
  • Changing multiple components at once: Hard to attribute improvements. Change one thing per variant.
  • No negative controls: You might miss hallucination behavior. Include unanswerable queries.
  • Tiny eval sets: Results swing wildly. Aim for 100–300 queries for stable trends.
Self-check
  • Can you explain each metric choice and threshold?
  • Can two reviewers replicate your labels with similar results?
  • Do failure examples map to concrete next steps?

Practical projects

  • Build a 150-query eval set for your domain with at least 20 adversarial cases.
  • Compare two retrievers and a reranker (BM25 vs dense vs dense+rereanker) using Recall@5, nDCG@10.
  • Add a judge for groundedness and citation correctness; target ≥95% citation correctness.
  • Create an error dashboard with slices by topic, query length, and difficulty.

Who this is for

  • NLP/ML engineers building RAG systems
  • Data scientists evaluating AI assistants
  • MLOps engineers defining quality gates

Prerequisites

  • Basic IR concepts (documents, passages, ranking)
  • Familiarity with LLM prompting and RAG architectures
  • Comfort with Python or similar for metrics

Learning path

  • Start: Retrieval metrics and dataset design
  • Then: Answer-level rubrics and LLM-as-judge
  • Next: Error slicing, calibration, and thresholds
  • Finally: Automate evaluation in CI and track over time

Next steps

  • Harden your eval set with new edge cases from user feedback
  • Automate nightly runs; track trends and alert on regressions
  • Plan an online A/B only after offline gates are met

Mini challenge

Your variant improves Recall@5 from 0.78 to 0.86 but groundedness drops from 1.8 to 1.6/2. Propose two hypotheses and one fast experiment to recover groundedness without losing recall.

Quick Test

Take the quick test below to check understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Use the provided small dataset to compute retrieval metrics and hallucination rate.

Data
  • Gold: Q1={A2,A7}, Q2={B4}, Q3={C3,C6}
  • Retrieved top-3: Q1=[A5,A2,A9], Q2=[B4,B1,B7], Q3=[C8,C5,C3]
  • Answer grounding labels (Supported/Unsupported): [Supported, Supported, Unsupported]
  • Compute Recall@3, MRR, and nDCG@3 (binary gains).
  • Compute hallucination rate: fraction of Unsupported answers.
Expected Output
Recall@3 = 1.00; MRR ≈ 0.611; nDCG@3 ≈ 0.646; Hallucination rate ≈ 0.333

Offline Evaluation For RAG — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Offline Evaluation For RAG?

AI Assistant

Ask questions about this tool