Note: The Quick Test at the end is available to everyone. Sign in to save your progress and resume later.
Who this is for
- NLP engineers and ML practitioners building LLM apps that must answer from private or dynamic data.
- Data engineers supporting vector search infrastructure.
- Product-minded technologists validating LLM features quickly and safely.
Prerequisites
- Basic Python and familiarity with HTTP APIs.
- Understanding of embeddings (vector representations) and cosine similarity at a high level.
- Comfort with prompt engineering basics (system/user messages, few-shots).
Why this matters
RAG (Retrieval-Augmented Generation) lets your LLM answer using the freshest, company-specific knowledge without retraining. In real projects you will:
- Build chat assistants grounded in policy docs, runbooks, and knowledge bases.
- Reduce hallucinations by surfacing relevant sources and citations.
- Ship updates fast by re-indexing documents instead of fine-tuning models.
Concept explained simply
RAG connects two worlds: a search engine and a language model. First, you search your knowledge base for the most relevant snippets. Then, you give those snippets to the LLM as context so it can answer accurately.
Mental model
Think of RAG as a librarian plus a writer:
- The librarian (retriever) quickly finds the best book paragraphs.
- The writer (LLM) reads them and drafts a clear, grounded answerâideally with citations.
Core components of a RAG system
- Ingestion & preprocessing: Collect data (docs, HTML, tickets). Clean text, remove boilerplate.
- Chunking: Split text into overlapping chunks (e.g., 200â400 tokens, 10â20% overlap) to balance recall and precision.
- Embedding: Convert chunks to vectors with an embeddings model. Store vectors with metadata (title, URL, section).
- Indexing: Save in a vector store supporting similarity search (cosine/inner product) and filters.
- Retrieval: For a user query, embed the query and fetch top-k chunks. Optionally add hybrid keyword+vector search.
- Reranking (optional): Use a cross-encoder reranker to improve top results.
- Prompt assembly: Build a prompt that includes instructions, the user question, and selected chunks as context.
- Generation: Call the LLM to produce an answer grounded in the context.
- Post-processing: Add citations, extract structured fields, or run safety checks.
- Feedback loop: Log queries, clicks, ratings; use them to improve chunking, retrieval, and prompts.
Design tips and defaults
- Start with chunk size ~300 tokens, 40â60 character overlap; tune later with small experiments.
- Use top-k=5â8 for retrieval as a baseline; adjust for longer/shorter contexts and domain complexity.
- Turn on reranking when your corpus is noisy or long-tail; it often helps more than increasing k.
- Always store source IDs and titles; you will want citations and filtering.
- Prompt: include clear grounding instructions like âAnswer only from the provided context; if missing, say you donât know.â
Worked examples
Example 1 â Internal policy QA assistant
- Goal: Answer HR policy questions with citations.
- Data: PDFs and docs from HR drive.
- Ingestion: Extract text; keep document, section, and page numbers.
- Chunking: 300 tokens, 15% overlap; separate by headings when possible.
- Embeddings: General-purpose multilingual embedding if policies vary by locale.
- Index: Vector store with metadata: {department: HR, country, doc_title, page}.
- Retrieval: top-k=6; filter by country if user specifies location.
- Rerank: Cross-encoder top-20 â top-6.
- Prompt: âUse only the context. If uncertain, say you donât know. Cite titles and pages.â
- Generation: Model produces answer + citations.
- Post: Show expandable citation cards; log user rating.
- Metrics: Answer accuracy, citation click-through, abstention rate (when answer unknown).
Example 2 â Developer troubleshooting helper
- Goal: Help devs resolve common errors fast.
- Data: Runbooks, incident notes, code snippets.
- Chunking: Smaller chunks (~200 tokens) because code and errors need precision.
- Embeddings: Code-aware embedding if available.
- Retrieval: Hybrid BM25 + vector search to capture exact error strings and semantics.
- Rerank: Yes; cross-encoder helps when many similar errors exist.
- Prompt: âSummarize likely root cause and steps. Provide commands verbatim from context.â
- Post: Extract step list; display risk notes if present in context.
Example 3 â Product FAQs chatbot
- Goal: Answer user questions about features and pricing.
- Data: Public docs, release notes, pricing pages.
- Chunking: 250â350 tokens, overlap 10%; ensure each chunk is self-contained.
- Retrieval: top-k=5; boost recent release notes via metadata recency.
- Guardrails: âIf pricing not found, ask clarifying questions or say you donât know.â
- Evaluation: Weekly spot checks with synthetic queries from top support tickets.
Design choices and trade-offs
- Chunk size: Larger chunks improve coverage but may add noise; smaller chunks increase precision but risk missing context.
- Overlap: Avoid cutting concepts mid-sentence; modest overlap reduces fragmentation without bloating the index.
- Embeddings model: Domain-specific embeddings can boost retrieval; test on your queries to confirm.
- Hybrid search: Combine keyword and vector search to catch exact terms and semantics.
- Reranking: Adds latency but often increases answer quality substantially in noisy corpora.
- Prompt strictness: Strong grounding instructions reduce hallucinations; add an explicit refusal pattern.
- k value: Too low k â missing facts; too high k â context dilution. Tune with offline benchmarks.
Common mistakes and self-check
- No metadata: Without document titles or sections, you cannot show citations or filter. Self-check: Does every chunk store source and section?
- Overlong context: Stuffing many chunks lowers relevance. Self-check: Is your prompt exceeding model context often?
- Vague prompts: LLM invents details. Self-check: Does the prompt explicitly say âanswer only from contextâ?
- No evaluation: Hard to know if changes help. Self-check: Do you track correctness on a small, stable query set?
- Ignoring latency: Rerankers and high k can slow UX. Self-check: Measure p95 latency before and after changes.
Practical projects
- Build a policy QA bot that cites sources and abstains when unsure.
- Create a troubleshooting assistant for a small codebase with hybrid retrieval and command extraction.
- Implement a release-notes explorer that boosts the most recent content.
Exercises
Do these now. They mirror the graded exercises below.
Exercise 1 â Design a minimal RAG pipeline (mirrors Ex1)
Scenario: You need a support QA bot for your product docs (about 500 pages, updated monthly). Draft a minimal RAG design.
- Choose chunk size and overlap and justify briefly.
- Pick an embeddings model type and list metadata to store.
- Decide on retrieval: vector only or hybrid? Choose k.
- Decide if youâll use a reranker; when and why.
- Write a 4â6 line prompt template with grounding instructions and citation requirement.
- Define two evaluation metrics you will track weekly.
Show example solution
Chunking: 300 tokens, 15% overlap. Embeddings: general-purpose; metadata: doc_title, section, url_id, updated_at. Retrieval: hybrid, k=6. Reranker: cross-encoder on top-20âtop-6 due to similar topics. Prompt: âUse only the context; if missing, say you donât know. Cite titles and sections. Keep answers under 150 words.â Metrics: accuracy on 30 curated queries; abstention correctness rate.
Exercise 2 â Chunking thought experiment (mirrors Ex2)
Text: âRefunds are available within 30 days of purchase. To request a refund, contact support with your order ID. Refunds are not available for custom services.â
- Create 2â3 chunks with overlap so that each chunk remains meaningful.
- Explain how your overlap reduces boundary issues for queries like âHow do I request a refund?â and âAre custom services refundable?â
Show example solution
Chunk A: âRefunds are available within 30 days of purchase. To request a refund, contact support with your order ID.â Chunk B (overlap): âTo request a refund, contact support with your order ID. Refunds are not available for custom services.â Reasoning: Overlap keeps the request steps visible while also capturing the exception. Both queries match at least one chunk fully.
Exercise checklist
- I justified chunk size and overlap in terms of recall vs precision.
- I included metadata for citations and filters.
- My prompt prohibits guessing and requests citations.
- I selected k based on corpus size and context limit.
- I defined simple, trackable evaluation metrics.
Mini challenge
Pick one of your current documents and write a 5-line prompt template plus a retrieval plan (k, filters, reranking). Keep total context under 1,200 tokens. Test on 3 real questions and note one improvement for tomorrow.
Learning path
- Next: Advanced retrieval strategies (hybrid search, multi-query, query rewriting).
- Then: Reranking, context compression, and response validation.
- Finally: Evaluation frameworks and offline/online metrics.
Next steps
- Complete the exercises below and take the Quick Test.
- Instrument your RAG prototype with latency and correctness logging.
- Schedule a weekly review to tune chunking, k, and prompts based on logs.