Why this matters
As an NLP Engineer, you will ship Retrieval-Augmented Generation (RAG) systems that answer questions from internal docs, product catalogs, or knowledge bases. Offline evaluation lets you compare variants safely before exposing users. It helps you answer: Is my retriever finding the right passages? Are answers grounded in the retrieved context? Did my change improve quality without breaking something else?
- Real tasks: Select top retriever for a domain, tune chunking and reranking, enforce citation correctness, and set quality gates for releases.
- Outcomes: Repeatable metrics, faster iteration cycles, fewer regressions, and clearer communication with stakeholders.
Concept explained simply
RAG is a two-stage pipeline: retrieve relevant passages, then generate an answer using those passages. Offline evaluation is like a lab test: feed a fixed set of questions with known references, run your system, and score the outputs against a rubric.
Mental model
- Inputs: A set of queries (Q), the source corpus, and ground-truth signals (gold passages and/or gold answers).
- Process: Run your RAG variant on Q, capture retrieved passages and generated answers, then compute metrics.
- Outputs: Summary metrics (retrieval, grounding, answer quality), error slices, and recommendations.
Key terms
- Gold passage: A document chunk that truly contains the answer.
- Gold answer: A vetted reference answer (may be short or long).
- Groundedness: The degree to which the answer’s claims are supported by retrieved text.
- Hallucination: A claim not supported by the provided context.
What to measure in RAG offline evaluation
Retrieval metrics
- Recall@k: Proportion of queries for which at least one gold passage appears in the top k.
- Precision@k: Fraction of the top k that are relevant.
- MRR (Mean Reciprocal Rank): Average of 1/rank of the first relevant result.
- nDCG@k: Ranking quality that rewards relevant items near the top.
Context quality checks
- Diversity: Are retrieved passages unique and complementary (not duplicates)?
- Redundancy: Excessive overlap lowers utility.
- Coverage: Do retrieved passages collectively contain the answer spans?
Answer-level metrics
- Groundedness/Faithfulness: Claims are supported by citations from retrieved text.
- Citation correctness: Citations point to passages that contain the cited facts.
- Exactness/Accuracy: Matches a gold answer (for factoid Q&A) or meets rubric criteria (for long-form).
- Helpfulness: Clear, concise, and complete for the user’s intent.
Simple scoring rubrics
- Groundedness (0–2): 0=Not supported; 1=Partially supported; 2=Fully supported.
- Helpfulness (1–5): Is the answer correct, complete, and concise?
- Citation correctness: For each claim with a citation, does cited text contain that claim? (Yes/No)
Designing a reliable evaluation set
- Balance: Include easy, medium, hard queries; include edge cases and ambiguities.
- Coverage: Reflect real user intents and important business areas.
- Golds: Create gold passages and/or gold answers via subject-matter experts or careful reviewer agreement.
- Synthetic augmentation: Generate queries from documents, then human-review a sample to calibrate.
- Negative controls: Add queries not answerable from your corpus to detect hallucination behavior.
- Agreement: Use clear rubrics and measure inter-annotator agreement on a subset.
LLM-as-a-judge tips
- Give the judge the query, candidate answer, and retrieved passages.
- Ask for a structured verdict (e.g., groundedness 0–2) and short justification with quoted spans.
- Calibrate with a few gold-labeled examples and check consistency on a holdout.
- Avoid revealing system prompts or irrelevant info; keep inputs minimal and comparable.
Offline evaluation workflow (step-by-step)
- Freeze components: Fix retriever, reranker, chunking, and generator for each variant.
- Assemble evaluation set: 100–300 queries usually give stable trends; include adversarial cases.
- Run retrieval: Store top-k doc IDs and scores per query.
- Run generation: Produce answers with citations; log prompts/configs.
- Score retrieval: Compute Recall@k, MRR, nDCG@k, and coverage.
- Score answers: Use rubric and/or LLM-as-judge for groundedness, helpfulness, citation correctness.
- Error analysis: Slice by topic, difficulty, query length, and failure modes.
- Decide gates: Define thresholds (e.g., Recall@5 ≥ 0.85; Groundedness ≥ 1.6/2).
- Compare variants: Report deltas with confidence intervals or bootstrap estimates.
Reporting template
- Setup: Variant name, components, k, prompt version.
- Dataset: Query count, domains, gold creation notes.
- Metrics: Retrieval (Recall@k, MRR, nDCG@k), Answer (Groundedness, Helpfulness, Citations).
- Findings: Top regressions, representative failures with IDs and snippets.
- Next actions: Hypotheses, changes to try, new edge cases to add.
Worked examples
Example 1: Internal knowledge base QA
Gold passages per query: Q1={A2,A7}, Q2={B4}, Q3={C3,C6}. Retrieved top-3:
- Q1: [A5, A2, A9] → first relevant at rank 2
- Q2: [B4, B1, B7] → first relevant at rank 1
- Q3: [C8, C5, C3] → first relevant at rank 3
- Recall@3: 3/3 = 1.00
- MRR: (1/2 + 1/1 + 1/3)/3 ≈ 0.611
- nDCG@3 (binary gains): Q1=0.631, Q2=1.000, Q3≈0.307 → avg ≈ 0.646
Interpretation: Retrieval brings at least one relevant passage for all queries, but rank quality can improve (MRR ~0.61).
Example 2: Product Q&A with citations
- Precision@3: 0.67 (2 of top 3 are relevant on average)
- Groundedness (0–2): mean 1.7
- Citation correctness: 92% of cited claims are supported in the cited passages
Interpretation: Good grounding, but remaining 8% incorrect citations are high-risk. Prioritize fixing citation mapping or reranking.
Example 3: Policy chatbot (long-form)
- Helpfulness (1–5): mean 4.1
- Hallucination rate on unanswerable controls: 28%
- Latency and token budget: similar across variants (useful secondary checks offline)
Interpretation: Answers are helpful but hallucinate on controls. Add explicit refusal behavior and strengthen retrieval fallback before deployment.
Exercises
Do these hands-on tasks to cement the workflow. A short checklist follows.
Exercise 1 — Compute retrieval and grounding metrics
Given 3 queries, gold passages, and retrieved results, compute Recall@3, MRR, and nDCG@3. Then compute hallucination rate from 3 judged answers.
Data
- Gold: Q1={A2,A7}, Q2={B4}, Q3={C3,C6}
- Retrieved top-3: Q1=[A5,A2,A9], Q2=[B4,B1,B7], Q3=[C8,C5,C3]
- Answer grounding labels (Supported/Unsupported): [Supported, Supported, Unsupported]
Record: Recall@3, MRR, nDCG@3 (binary gains), Hallucination rate.
Exercise 2 — Draft an LLM-judge rubric and prompt
Create a concise rubric for Groundedness (0–2) and Citation correctness (Yes/No). Write a judge prompt that returns JSON with fields: groundedness, citation_correct, evidence_spans, and justification (short).
Tips
- Ask the judge to quote minimal evidence spans.
- Keep outputs structured and deterministic.
Exercise checklist
- Computed retrieval metrics correctly
- Computed hallucination rate
- Wrote a clear, compact judge rubric
- Produced a structured judge prompt
Common mistakes and self-check
- Only measuring answer quality: Always measure retrieval too; poor retrieval caps answer quality.
- Unclear rubrics: Leads to noisy labels. Write short scales with examples.
- Changing multiple components at once: Hard to attribute improvements. Change one thing per variant.
- No negative controls: You might miss hallucination behavior. Include unanswerable queries.
- Tiny eval sets: Results swing wildly. Aim for 100–300 queries for stable trends.
Self-check
- Can you explain each metric choice and threshold?
- Can two reviewers replicate your labels with similar results?
- Do failure examples map to concrete next steps?
Practical projects
- Build a 150-query eval set for your domain with at least 20 adversarial cases.
- Compare two retrievers and a reranker (BM25 vs dense vs dense+rereanker) using Recall@5, nDCG@10.
- Add a judge for groundedness and citation correctness; target ≥95% citation correctness.
- Create an error dashboard with slices by topic, query length, and difficulty.
Who this is for
- NLP/ML engineers building RAG systems
- Data scientists evaluating AI assistants
- MLOps engineers defining quality gates
Prerequisites
- Basic IR concepts (documents, passages, ranking)
- Familiarity with LLM prompting and RAG architectures
- Comfort with Python or similar for metrics
Learning path
- Start: Retrieval metrics and dataset design
- Then: Answer-level rubrics and LLM-as-judge
- Next: Error slicing, calibration, and thresholds
- Finally: Automate evaluation in CI and track over time
Next steps
- Harden your eval set with new edge cases from user feedback
- Automate nightly runs; track trends and alert on regressions
- Plan an online A/B only after offline gates are met
Mini challenge
Your variant improves Recall@5 from 0.78 to 0.86 but groundedness drops from 1.8 to 1.6/2. Propose two hypotheses and one fast experiment to recover groundedness without losing recall.
Quick Test
Take the quick test below to check understanding. Available to everyone; only logged-in users get saved progress.