Why this matters
Retrieval-Augmented Generation (RAG) lets AI products answer questions using your organizationâs documents and data, reducing hallucinations and enabling up-to-date responses without retraining models. As an AI Product Manager, you will:
- Define when to use RAG vs. fine-tuning for a feature.
- Set acceptance criteria for grounded answers and citations.
- Decide retrieval sources, chunking, top-k, and reranking strategies.
- Balance latency, cost, and quality; set SLAs (e.g., p95 latency, answer support rate).
- Plan evaluations, safety filters, and monitoring.
Concept explained simply
RAG is like an âopen-book examâ for an AI model. Instead of the model guessing from memory, it looks up relevant passages from your data and uses them to answer.
- User question: What the user asks.
- Retriever: Finds relevant text chunks from indexed documents (e.g., policies, knowledge base). Retrieval can use keyword search (BM25), vector search (embeddings), or both (hybrid).
- Context assembly: Top documents/chunks are added to the prompt with instructions.
- LLM generation: The model writes an answer grounded in the retrieved context, often with citations.
When to prefer RAG vs. fine-tuning
- Use RAG when facts change often, data is proprietary or access-controlled, or you need citations.
- Use fine-tuning when you need style/format consistency or domain language, not dynamic factual recall.
Mental model
Imagine a pipeline with adjustable knobs:
- Ingest: Split documents into chunks with overlaps; index with embeddings + metadata.
- Retrieve: Convert the question to a vector; search; optionally rerank the top results.
- Assemble: Build a prompt with the question, instructions, and the best supporting chunks.
- Generate: The LLM answers, ideally citing sources.
- Evaluate and monitor: Check groundedness, usefulness, and latency; iterate.
Quality levers: chunk size, overlap, top-k, hybrid retrieval, rerankers, query rewriting, prompt instructions, and output format (e.g., JSON + citations).
Core building blocks
- Document ingestion: Normalize, deduplicate, chunk (e.g., 300â800 tokens) with overlap (10â20%). Attach metadata (source, date, access).
- Embeddings: Numeric vectors representing semantics for vector search. Choose a model that balances cost, speed, and multilingual needs.
- Storage: Vector database or search engine supporting filters on metadata (e.g., access control).
- Retrieval: Keyword (BM25), vector, or hybrid. Tune top-k (e.g., 5â20). Consider domain synonyms and query rewriting.
- Reranking: A lightweight model or scoring that improves precision on the top candidates.
- Prompting: Clear instructions to only use provided sources; require citations.
- Response formatting: Structured output (answer + cited chunks) for UI rendering.
- Evaluation: Groundedness, citation accuracy, answer relevance, coverage/recall, latency p95, cost per answer.
- Safety: PII redaction, permission-aware retrieval, abuse filters, rate limits.
- Monitoring: Track failures (no hits, low similarity), top queries, drift in docs, timeout rates.
Worked examples
Example 1: HR policy assistant
Goal: Employees ask about leave, benefits, and holidays; answers must cite the policy page and version.
- Ingestion: Chunk policy PDFs to ~500 tokens with 15% overlap; metadata: department, effective_date, version.
- Retrieval: Hybrid (BM25 + embeddings), top-k=12, then rerank top-12 to top-5.
- Prompt: âAnswer only from the context. Include 1â3 citations with policy titles and sections. If unsure, say you donât know.â
- Acceptance criteria: â„90% answers have supporting citation; p95 latency â€2.5s; no unsupported claims.
What if it hallucinates?
- Increase top-k or switch to hybrid retrieval if recall is low.
- Add domain synonyms (e.g., PTO = paid time off).
- Strengthen prompt to refuse unsupported answers.
Example 2: E-commerce product Q&A
Goal: âDoes this laptop support two external monitors?â
- Ingestion: Product specs and manuals; chunk by sections (ports, graphics, OS).
- Retrieval: Filter by product_id; vector search top-k=8; rerank to top-3.
- Query rewriting: Expand synonyms (dual monitors = two external displays; MST).
- Output: Answer + citation to manual section; if missing, suggest compatible docking stations with citations.
Example 3: Financial filing assistant
Goal: Summarize revenue recognition notes and cite where numbers come from.
- Ingestion: 10-Ks by company and year; chunk by headings; metadata: ticker, year, section.
- Retrieval: Hybrid; top-k=20 then rerank to 5; require at least one chunk from the âRevenueâ section.
- Prompt: Return JSON: {"summary":..., "citations":[{ticker,year,section,page}]}.
- Evaluation: Human spot-check on 30 samples; target â„95% citation accuracy.
Decision checklist
- Documents are chunked with sensible size and overlap.
- Retrieval uses hybrid or the best single approach for the domain.
- Reranker is enabled if precision of top results is low.
- Prompt enforces use of sources and refusal when missing.
- Output includes citations and metadata needed by the UI.
- Metrics defined: groundedness, citation accuracy, latency p95, cost.
- Safety: permission filters, PII protection, rate limits.
Exercises
Note: The quick test at the end is available to everyone. Only logged-in users get saved progress.
Exercise 1: Design a retrieval strategy
Scenario: Build an HR policy assistant for a 3,000-person company with policies in PDF and HTML. Draft your RAG strategy.
- Pick chunk size and overlap, with a short rationale.
- Choose retrieval (BM25, vector, or hybrid), top-k, and reranking.
- Define acceptance criteria (groundedness, citations, latency).
- Outline an offline evaluation plan (sample size, metrics).
Expected output
- Chosen chunking, retrieval, and tuning numbers.
- Acceptance criteria with target thresholds.
- Evaluation plan steps and metrics.
Hints
- Policies have headingsâconsider section-aware chunking.
- Hybrid + rerank often improves recall then precision.
- Track both correctness and refusal behavior.
Exercise 2: Debug a failing answer
Issue: Users ask, âWhat is the return window for refurbished items?â The system replies with the new items policy.
- Propose a root cause based on retrieval and metadata.
- Suggest changes to fix recall/precision.
- Define a quick A/B check to verify the fix.
Expected output
- Root cause hypothesis (e.g., synonym mismatch, missing filter for item_condition).
- Changes (hybrid retrieval, synonyms, metadata filter, boosted reranker).
- A/B plan with success criteria (e.g., citation to refurbished section â„90%).
Hints
- Check if ârefurbishedâ appears in chunks; add condition filters.
- Try query rewriting: refurbished = renewed.
- Increase top-k then rerank to focus on condition-specific chunks.
Common mistakes and self-check
- Too-large chunks: Irrelevant text dilutes relevance. Self-check: Are top chunks laser-focused on the query?
- Low recall: Only keyword or only vector search. Self-check: Try hybrid and examine missed hits.
- No citations: Hard to trust answers. Self-check: Enforce citations in the prompt and UI.
- Ignoring permissions: Leaks sensitive info. Self-check: Filter retrieval by user access.
- Over-tuning top-k: Higher k increases latency and cost. Self-check: Measure quality gain vs. p95 latency.
- No evaluation plan: Shipping blind. Self-check: Define groundedness and citation accuracy targets before launch.
Practical projects
- Policy Q&A MVP: Load 20 policy docs, ship a small RAG chatbot with citations and refusal rules.
- Product manual helper: Answer 50 common questions from 5 manuals; measure citation accuracy.
- Support macro generator: Retrieve relevant KB snippets to draft support replies with links to cited sections.
Learning path
- Learn retrieval basics: keyword vs. vector; try both on sample docs.
- Build a minimal RAG: chunking, embeddings, top-k=5, simple prompt with citations.
- Add precision: hybrid retrieval + reranker; add query rewriting and metadata filters.
- Define metrics: groundedness, citation accuracy, latency p95, cost per answer.
- Run offline eval; then a small online A/B with guardrails and monitoring.
Mini challenge
You own an internal compliance assistant. Users ask about region-specific data retention rules. Propose:
- Metadata schema to enforce region access (e.g., country, department, effective_date).
- Retrieval settings (hybrid? top-k? rerank?).
- Two acceptance criteria and how you will measure them.
Sample approach
- Metadata: region, jurisdiction, policy_owner, effective_date, version.
- Retrieval: hybrid top-k=15, rerank to 5; filter by user.region.
- Targets: â„92% groundedness; p95 latency â€3s.
Next steps
- Explore hybrid strategies and rerankers for difficult queries.
- Design an evaluation set that mirrors real user intents and edge cases.
- Add caching for frequent queries and track quality vs. latency trade-offs.
- Plan ongoing monitoring: top failed queries, drift, and citation accuracy trends.
Who this is for
- AI Product Managers shipping search, Q&A, or document-grounded features.
- PMs coordinating with data, ML, and platform teams.
Prerequisites
- Basic understanding of LLM prompting and context windows.
- High-level familiarity with search concepts (keywords, relevance).
- Comfort defining product metrics and running small experiments.