Why this skill matters for NLP Engineers
LLM Applications and RAG (Retrieval-Augmented Generation) let you build reliable, business-ready systems that answer questions with up-to-date and proprietary knowledge. As an NLP Engineer, this unlocks tasks like enterprise Q&A, analytics copilots, support assistants, research agents, and compliance-aware generation. You will combine prompts, retrieval, reranking, tool use, and guardrails to deliver grounded, auditable answers rather than generic text.
When to use RAG vs. fine-tuning
- Use RAG when knowledge changes often or must be cited/controlled per query.
- Use fine-tuning when you need style/formatting adaptation or domain reasoning patterns that do not rely on specific documents.
- Often, use both: RAG for grounding, light fine-tuning or prompt tuning for output style and task fit.
Who this is for
NLP Engineers and ML practitioners building chatbots, search+chat, assistants, or workflow agents that must be accurate, explainable, and aligned with company knowledge.
Prerequisites
- Comfortable with Python and basic data structures.
- Basic NLP concepts: tokenization, embeddings, cosine similarity.
- Familiarity with HTTP APIs and JSON.
- Understanding of model context windows and rate limits.
Learning path (practical roadmap)
1) Model + Data Fit
Pick a base LLM (latency, cost, context window). Identify your corpus (docs, FAQs, tickets) and define answer requirements: format, citations, refusal policy.
2) Indexing & Chunking
Normalize documents, split into chunks with overlap, compute embeddings, and store vectors + metadata (doc_id, title, URL or internal reference).
3) Retrieval & Reranking
Implement top-k retrieval, then rerank for relevance with cosine or cross-encoders. Test recall vs. latency trade-offs.
4) Prompting & Grounding
Design structured prompts requiring citation brackets (e.g., [doc42]) and a clear refusal rule when evidence is missing.
5) Guardrails
Add content rules (no PII leakage), function calling for tools (search, calculators), schema validation, and safe fallbacks.
6) Offline Evaluation
Build a small gold set with queries, references, and source docs. Evaluate answer faithfulness, citation correctness, context precision/recall.
7) Observe & Iterate
Log prompts, retrieved chunks, outputs, and user feedback. Adjust chunking, reranking, prompts, and tools based on errors.
Worked examples
Example 1: Minimal RAG pipeline (Python)
# Minimal, library-light sketch for clarity
import math
def embed(text: str):
# Placeholder embedding: split chars into simple features.
# Replace with a real embedding model in production.
vec = [0]*26
for ch in text.lower():
if 'a' <= ch <= 'z':
vec[ord(ch)-97] += 1
# L2 normalize
norm = math.sqrt(sum(v*v for v in vec)) or 1.0
return [v/norm for v in vec]
def cosine(a, b):
return sum(x*y for x,y in zip(a,b))
# Build index
documents = [
{'doc_id': 'doc1', 'text': 'RAG combines retrieval with generation to ground answers.'},
{'doc_id': 'doc2', 'text': 'Chunk text with overlap to preserve context across boundaries.'},
{'doc_id': 'doc3', 'text': 'Use citations like [doc1] in the final answer.'},
]
index = [{'doc_id': d['doc_id'], 'text': d['text'], 'vec': embed(d['text'])} for d in documents]
def retrieve(query, k=2):
qv = embed(query)
scored = sorted(
[{'doc_id': r['doc_id'], 'text': r['text'], 'score': cosine(qv, r['vec'])} for r in index],
key=lambda x: x['score'], reverse=True
)
return scored[:k]
SYSTEM_PROMPT = (
"You are a helpful assistant. Use ONLY the provided context. "
"Cite sources like [doc_id]. If unsure, say 'I don't know'."
)
def llm_generate(prompt: str):
# Placeholder: echo with a simple template.
return f"Answer (simulated): Based on context, {prompt[:120]} ... [doc1]"
def answer_query(query):
ctx = retrieve(query, k=2)
context_text = "\n\n".join(f"[{c['doc_id']}] {c['text']}" for c in ctx)
user_prompt = (
f"{SYSTEM_PROMPT}\n\nContext:\n{context_text}\n\nQuestion: {query}\n"
f"Instructions: Provide a concise answer with citations."
)
return llm_generate(user_prompt)
print(answer_query('How does RAG ensure grounded answers?'))
Example 2: Prompt design for reliability
# Structure messages and enforce output shape
system = (
"Role: Enterprise QA Assistant.\n"
"Rules: \n- Use only provided context.\n- Cite sources like [doc_id].\n- If no evidence, say 'I don't know'.\n"
)
user = (
"Context:\n[doc1] RAG combines retrieval with generation to ground answers.\n\n"
"Question: Explain RAG grounding. Return JSON with fields: answer, citations (array), confidence (0-1)."
)
# Expected shape guidance (add to prompt):
shape = (
"Respond EXACTLY as JSON: {\"answer\": string, \"citations\": [string], \"confidence\": number}."
)
# Combine into your model call; validate JSON after generation.
Tip: Keep rules short, explicit, and test with edge cases (no context, conflicting docs).
Example 3: Context window management (chunking)
def chunk(text, max_chars=800, overlap=120):
parts = []
i = 0
while i < len(text):
parts.append(text[i:i+max_chars])
i += max(1, max_chars - overlap)
return parts
long_doc = "..." # your source text
chunks = chunk(long_doc, max_chars=800, overlap=120)
# Index each chunk with metadata: doc_id, chunk_id, title
Guideline: start with 500–1000 characters (or 200–400 tokens) and overlap 10–20%.
Example 4: Simple reranking (MMR-style)
def mmr(query_vec, candidates, lambda_mult=0.7, top_k=4):
selected = []
while candidates and len(selected) < top_k:
best = None
best_score = -1e9
for c in candidates:
relevance = cosine(query_vec, c['vec'])
diversity = 0.0
if selected:
diversity = max(cosine(c['vec'], s['vec']) for s in selected)
score = lambda_mult * relevance - (1 - lambda_mult) * diversity
if score > best_score:
best_score = score
best = c
selected.append(best)
candidates = [c for c in candidates if c is not best]
return selected
MMR balances relevance and diversity to reduce redundant chunks.
Example 5: Tool use and function calling
# Describe tools the assistant may call
TOOLS = [
{
'name': 'search_docs',
'description': 'Semantic search over indexed enterprise documents',
'input_schema': {'type': 'object', 'properties': {'query': {'type': 'string'}}, 'required': ['query']}
},
{
'name': 'calc',
'description': 'Perform arithmetic operations',
'input_schema': {'type': 'object', 'properties': {'expr': {'type': 'string'}}, 'required': ['expr']}
}
]
# Basic router based on intent heuristics
import re
def route(user_text):
if re.search(r'\b(sum|add|\d+\s*[+\-*/])\b', user_text):
return {'tool': 'calc', 'args': {'expr': user_text}}
if len(user_text.split()) > 6:
return {'tool': 'search_docs', 'args': {'query': user_text}}
return {'tool': None, 'args': {}}
In production, let the LLM propose a tool call by outputting a structured JSON with tool name and arguments, then execute safely and pass results back.
Example 6: Guardrails and safe fallbacks
BLOCKLIST = [
r'password', r'api[-_ ]?key', r'ssn', r'credit\s*card'
]
import re
def violates(text):
return any(re.search(pat, text, flags=re.I) for pat in BLOCKLIST)
def safe_answer(context, draft):
if violates(draft):
return "I can't share sensitive information. Please clarify your request without secrets."
if 'I don\'t know' in draft or len(context.strip()) == 0:
return "I don't know based on the provided context."
return draft
Combine policy prompts with programmatic checks. Always prefer refusal over confident guessing.
Drills and exercises
- [ ] Take 10 random queries and verify each answer contains at least one correct citation like [doc_id].
- [ ] Remove top-1 retrieved chunk and see if answers degrade; adjust k and reranking until stable.
- [ ] Stress test: very long question, very short question, ambiguous terms, and conflicting documents.
- [ ] Add a tool (calculator or date parser) and ensure the assistant calls it only when needed.
- [ ] Build a tiny eval set (20 queries) and compute context recall and citation correctness.
Common mistakes and debugging tips
- Mistake: Oversized chunks. Fix: Reduce chunk size, add overlap, and compress with summaries when needed.
- Mistake: Missing refusal rule. Fix: Add explicit instruction to say 'I don't know' without evidence.
- Mistake: Trusting raw retrieval scores. Fix: Add reranking and inspect top candidates.
- Mistake: Unvalidated JSON outputs. Fix: Schema-validate; re-prompt with error messages if invalid.
- Mistake: Hidden prompt leakage. Fix: Keep system rules minimal and logged; avoid secrets in prompts.
- Debug tip: Log query, retrieved chunks, final prompt, and output for every request; sample failures for manual review.
Mini project: Enterprise policy Q&A with citations
Goal: A console app that answers employee questions about internal policies with citations and safe refusals.
- Ingest: Load 30–100 policy pages, clean text, chunk (e.g., 800 chars, 120 overlap), embed, and index.
- Query: Retrieve top-8, rerank to top-4 with MMR.
- Prompt: Require JSON output: answer, citations[], confidence. Enforce refusal if no evidence.
- Guardrails: Block secrets/PII. Cap answer length. Strip any URLs from output; keep only [doc_id] citations.
- Eval: Create 25 questions with expected references. Report: context recall, citation correctness, and faithfulness.
Acceptance checklist
- [ ] Outputs valid JSON consistently.
- [ ] At least one citation in every supported answer.
- [ ] Correct refusal when evidence absent.
- [ ] Passes offline eval with ≥ 70% faithfulness.
- [ ] Latency under 2s on average for your setup (adjust as needed).
Subskills
- RAG Architecture Basics: Components, pipelines, indexing, and E2E data flow.
- Prompt Design For Reliability: Roles, rules, JSON outputs, refusal policies.
- Context Window Management: Chunking, overlaps, compression, summaries.
- Grounding And Citations Concepts: Evidence-first answers with [doc_id] style citations.
- Reranking And Context Selection: Cosine, cross-encoders, MMR diversity.
- Tool Use And Function Calling Basics: Schemas, routing, safe execution, loopbacks.
- Handling Hallucinations With Guardrails: Policies, validators, constrained generation.
- Offline Evaluation For RAG: Gold sets, faithfulness, citation checks, regression tests.
Next steps
- Grow your eval set to cover real user intents, edge cases, and adversarial prompts.
- Add domain-specific tools (search indices, internal APIs) and monitor tool usage rate.
- Experiment with compression prompts and response length to fit tighter context windows.
- Automate nightly offline eval and alert on regressions.