Why this matters
As a Prompt Engineer, you often need the model to answer with facts from private or changing data (policies, docs, code). Retrieval-Augmented Generation (RAG) plugs your prompts into a retrieval layer so answers are grounded in the right context. Real tasks include:
- Building helpdesk bots that cite the company handbook.
- Drafting summaries with references to specific documents.
- Creating code assistants that search repos and explain functions.
- Ensuring compliance answers come only from approved sources.
Concept explained simply
RAG adds a step before prompting: find the most relevant chunks of text, then give both the user question and those chunks to the model. The model stays grounded instead of guessing.
Mental model
Think of RAG as a librarian + writer:
- Librarian (retriever): finds the right book pages (chunks) using embeddings, keyword search, or hybrid.
- Writer (LLM): answers using only those pages, with citations.
Your job: make sure the librarian brings the right pages (good chunking, indexing, filters) and the writer follows the rules (good prompts, guardrails, evaluation).
Core components of a RAG pipeline
- Ingestion: split content into chunks, add metadata (source, date, tags), embed, and index.
- Retrieval: given a query, return k best chunks. Options: dense (embeddings), keyword/BM25, or hybrid.
- Ranking/Compression (optional): re-rank results and trim to fit the model context window.
- Prompting: system/user templates that enforce grounding and formatting.
- Generation controls: temperature, max tokens, citation instructions.
- Evaluation: check relevance hit-rate, groundedness, faithfulness, and latency/cost.
Implementation steps (from zero to answer)
- Define task and constraints: what must be cited? allowed sources? latency/cost limits?
- Choose chunking: windowed chunks (e.g., 400–800 tokens with 10–20% overlap) for narratives; smaller or code-aware chunks for code.
- Index + metadata: store source name, section, date, access level; enable filters (e.g., product=Pro).
- Pick retriever: start hybrid (BM25 + embeddings). Tune top_k (e.g., 4–8) and add filters.
- Re-rank/compress: optional LLM or cross-encoder re-ranker; keep only highly relevant sentences.
- Prompt template: force citations, no guessing, refusal policy, and structured output.
- Evaluate: run a test set, measure hit-rate and groundedness, iterate.
Prompt patterns for RAG
Grounded Q&A template
System: You are a precise assistant. Use ONLY the provided context. If unsure or if context is missing, say you don't know.
Include citations like [source:TITLE] after each supported claim.
User question: {{question}}
Context:
{{context_blocks}}
Answer (short, factual, with citations):
Summarization with citations
System: Create a concise summary using only the context. Provide 3-5 bullet points.
Add citations like [source:TITLE] per bullet.
Topic: {{topic}}
Context:
{{context_blocks}}
Refusal and escalation
If the answer is not supported by context or the user asks beyond scope, reply: "I don't have that information in the provided documents. Please provide more details or approved sources."
Worked examples
Example 1: FAQ chatbot for HR handbook
- Chunking: 600-token chunks, 60-token overlap. Metadata: section, policy_date, version.
- Retriever: hybrid; filter policy_date >= latest_version_date.
- Prompt: grounded Q&A template with mandatory citations.
- Evaluation: 50 queries; target hit-rate ≥ 90%, groundedness ≥ 85%.
Example 2: Compliance summarizer with citations
- Chunking: 300–500 tokens, overlap 15% for dense regulatory text.
- Compression: sentence-level extractive compression to fit context.
- Prompt: summarization with citations; refusal if context is insufficient.
- Check: every bullet contains at least one citation.
Example 3: Codebase assistant
- Chunking: function-level for code; filename and language in metadata.
- Retriever: hybrid with strong keyword weighting for identifiers.
- Prompt: explain function behavior; require code references [file:line-range].
- Guardrail: do not speculate about files not shown in context.
Evaluation and guardrails
- Retrieval hit-rate: percent of test questions where a gold chunk appears in top_k.
- Groundedness/Faithfulness: percent of claims supported by provided chunks.
- Latency: p95 time; optimize top_k, re-ranker, and context size.
- Cost: tokens retrieved + generated; reduce with compression and concise prompts.
Self-check checklist
- [ ] The model refuses when context is missing.
- [ ] Every claim has a citation.
- [ ] Changing the retrieval corpus changes the answer (indicates grounding).
- [ ] Reasonable latency (e.g., < 2s for chat).
Performance tuning tips
- Start with top_k 4–8; increasing beyond that may add noise and cost.
- Use hybrid retrieval when queries contain rare keywords or identifiers.
- Compress context to only sentences relevant to the query.
- Lower temperature (0.0–0.3) to reduce creative but ungrounded output.
Common mistakes and how to self-check
- Mistake: Oversized chunks. Symptom: irrelevant context; low groundedness. Fix: smaller chunks with slight overlap.
- Mistake: No metadata filters. Symptom: outdated answers. Fix: filter by version/date/access.
- Mistake: Missing refusal policy. Symptom: hallucinations. Fix: explicit refusal + test cases.
- Mistake: Assuming higher top_k always helps. Symptom: slower, noisier answers. Fix: tune and measure.
Exercises
Note: Anyone can take the exercises and quick test. Only logged-in users will see saved progress.
- Exercise 1 — Design a grounded Q&A prompt
Draft a RAG prompt for a company handbook bot. Specify chunking, metadata filters, and the exact prompt text. Add 3 self-check criteria.Need a nudge?
- Try 400–700 token chunks with small overlap.
- Filter by latest version or product tier.
- Citations can use [source:TITLE].
- Exercise 2 — Evaluate and iterate
Given 10 test queries, compute retrieval hit-rate and groundedness. Propose two concrete pipeline changes based on results.Need a nudge?
- Hit-rate: gold chunk present in top_k?
- Groundedness: are claims supported by retrieved text?
- Consider re-ranking and filters.
Practical projects
- Build a policy Q&A bot that cites answers and refuses when unsure.
- Create a meeting-note summarizer that links each bullet to a transcript snippet.
- Make a code explainer that references file names and line ranges.
Who this is for
- Prompt Engineers integrating LLMs with internal knowledge bases.
- Data/ML engineers adding retrieval to production assistants.
- Product folks validating grounded LLM prototypes.
Prerequisites
- Basic prompt design patterns (system/user templates, constraints).
- Familiarity with embeddings and vector search concepts.
- Understanding of tokens and context windows.
Learning path
- Review core RAG components and patterns (this lesson).
- Implement a small RAG demo with 20–50 documents.
- Add evaluation: hit-rate, groundedness, latency.
- Scale: better chunking, re-ranking, compression, and guardrails.
Mini challenge
Given a user asks, "Does our Pro plan include SSO?" and your top_k chunks include conflicting answers from two versions of the pricing page, design a prompt snippet and a retrieval filter to ensure the model prefers the newest information and cites it.
Next steps
- Complete Exercises 1–2 and check your outcomes against the self-check list.
- Run the Quick Test below to confirm understanding.
- Apply these patterns to one of the Practical projects.