Who this is for
NLP Engineers, ML Engineers, and advanced Data Scientists building LLM apps and RAG systems who need reliable, safe answers grounded in evidence.
Prerequisites
- Comfort with prompt design and RAG basics (retrieval, chunks, embeddings)
- Understanding of JSON structures and simple schemas
- Familiarity with evaluation metrics (precision/recall) is helpful
Why this matters
In real projects, hallucinations lead to wrong recommendations, fake citations, and loss of trust. As an NLP Engineer, you will:
- Ship Q&A assistants that must quote sources exactly
- Build structured extractors that cannot invent fields
- Deploy support bots that should refuse when evidence is missing
- Comply with safety and privacy constraints
Guardrails reduce hallucinations by combining better grounding, stricter outputs, and automated checks.
Concept explained simply
Hallucinations happen when a model confidently produces content not supported by inputs or reality. Guardrails are rules and checks that limit this behavior.
Mental model: The 3-layer defense
- Layer 1 — Data guardrails: retrieve the right context and expose uncertainty (similarity thresholds, top-k diversity, explicit citations).
- Layer 2 — Model guardrails: clear instructions, safe defaults (refusal), and decoding controls (temperature, max tokens).
- Layer 3 — Output guardrails: structure and verify results (schemas, validators, cross-checkers).
Quick checklist for any LLM feature
- Is the model required to cite sources?
- Is there a refusal path when evidence is weak?
- Is output validated (format, units, claims)?
- Is there a second pass verifier or heuristic checker?
- Is uncertainty visible to the user?
Core guardrail techniques
1) Retrieval-anchored prompting with explicit citations
- Instruction: "Answer using only the provided snippets. If missing, say 'I don't know.' Cite sources like [S1], [S2]."
- Require a Sources section listing snippet IDs and titles.
- Set a minimum similarity threshold to permit answering.
2) Output schema validation
- Force structured output (e.g., JSON with fields: answer, citations[], confidence).
- Reject responses that violate schema and re-ask the model to correct them.
3) Lightweight fact verification
- Verifier pass: check that every claim is supported by retrieved text.
- Return flags like supported: yes/no, missing_claims: [].
4) Safe refusals and scope control
- Explicit refusal policy for low evidence or out-of-domain requests.
- Never guess private or sensitive data; never invent IDs or PII.
5) Retrieval quality guardrails
- Top-k with diversity to avoid near-duplicate chunks.
- Fallback to broader search or ask for clarification if results conflict.
6) Decoding controls
- Lower temperature for facts; limit max tokens to reduce drift.
- Constrain style: concise, neutral tone.
7) Tool-use gating
- Only call tools (e.g., calculators) when inputs are present and safe.
- Validate tool outputs before finalizing answer.
8) Monitoring & feedback
- Log refusal rates, schema errors, unsupported claims.
- Sample and review failures; adjust thresholds and prompts.
Worked examples
Example 1: RAG Q&A with citations and refusal
Goal: Answer questions about a policy using only retrieved snippets.
- Top-k = 4 with diversity
- Answer allowed only if max similarity >= 0.62
System: You are a precise assistant. Use only the provided snippets. If unsure, say "I don't know". User context: [S1] ...title: "Warranty Basics" ... [S2] ...title: "Claim Process" ... User question: "Is accidental damage covered?" Requirements: - Answer in 2 sentences max. - Include Sources: [S?] with titles. - If not supported, say "I don't know" and explain briefly.
{
"answer": "Accidental damage is not covered under the standard warranty.",
"citations": [{"id": "S2", "title": "Claim Process"}],
"confidence": 0.86,
"refused": false
}If similarity < 0.62, output:
{
"answer": "I don't know. The provided sources do not specify accidental damage.",
"citations": [],
"confidence": 0.30,
"refused": true
}Example 2: Structured extraction with schema repair
{
"name": "string",
"price": "number",
"currency": "string (ISO 4217)",
"source_ids": "array[string]"
}Extract the product info from the snippets. If a field is missing, leave it null. Never guess.
If currency is not a 3-letter code or price is not a number, reject and ask the model: "Return valid JSON only. Do not invent values; use null if unsupported." The corrected pass should satisfy the schema or refuse.
Example 3: Tool-use gating with verifier
User asks: "Convert 129.5 EUR to USD using today's rate and summarize the warranty exceptions."
- Allow tool call only if an explicit rate is provided in context; otherwise ask for the rate.
- Verifier checks that summary claims exist in snippets.
If no rate is provided, the assistant responds: "I don't know. I need an exchange rate to convert. I can summarize exceptions with sources if you provide the rate."
Example 4: Ambiguity handling
Question: "How long is coverage?" Context has "12 months" for electronics and "6 months" for accessories.
Guardrail: If multiple conflicting answers exist, ask a clarifying question or present both with sources.
Answer: Coverage varies: electronics 12 months [S1], accessories 6 months [S3].
Common mistakes and self-check
- Letting the model answer without citations. Self-check: Every answer has Sources with IDs/titles.
- Using a schema but not validating it. Self-check: Do invalid outputs trigger correction?
- Low similarity threshold (answers on weak context). Self-check: Review false positives; tune threshold.
- No refusal path. Self-check: Can the system say "I don't know" with a reason?
- Ignoring conflicting snippets. Self-check: Do you clarify or present both?
Exercises
Note: Everyone can do the exercises and test. Only logged-in users will have their progress saved.
Exercise 1: Design a guarded RAG template
Build a prompt + output schema for a policy Q&A assistant:
- Allow answers only if max similarity >= 0.6; otherwise refuse with a brief reason.
- Require at least one citation with id and title.
- Output JSON with fields: answer (string), citations (array of {id, title}), confidence (0-1), refused (boolean).
- Add an instruction to keep facts strictly within provided snippets.
Provide one example of a valid JSON answer and one valid refusal.
Hint
Write refusal logic explicitly in the instructions and enforce it via schema fields refused=true and empty citations.
Exercise 2: Write a verifier pass
Create a second-pass verifier prompt that receives: user_question, model_answer, citations, and snippets[]. It must return:
{
"supported": true|false,
"missing_claims": ["..."],
"format_ok": true|false
}Acceptance policy: publish only if supported=true and format_ok=true; otherwise trigger correction or refusal.
Hint
Ask the verifier to quote short phrases from snippets that support each claim, and to mark any claim without a matching phrase as missing.
Exercise checklist
- Refusal condition is explicit and testable
- JSON schema covers answer, citations, confidence, refused
- Verifier defines supported/missing_claims clearly
- At least one realistic example for both success and refusal
Practical projects
- Evidence-Based FAQ Bot: Answers from your docs, requires 2+ citations, automatic refusal on low similarity.
- Claim-Checked Summarizer: Produces summaries with claim bullets; each bullet must map to snippet spans.
- Structured Policy Extractor: Extracts fields (coverage_period, exclusions[]) with schema validation and repair.
Learning path
- Start: Retrieval tuning (chunking, top-k, threshold)
- Add: Prompt instructions for citations + refusals
- Enforce: JSON schema and validator + correction loop
- Harden: Verifier pass for claim support
- Operate: Monitoring dashboard for refusals and verifier flags
Next steps
- Integrate verifier outcomes into user messaging (show uncertainty)
- Iterate on thresholds using real queries and error logs
- Add clarification questions when context conflicts
Mini challenge
Take any answer from your RAG system. Highlight each claim and link it to a snippet span. If any claim lacks support, rewrite the answer or refuse. Repeat until every claim has a citation.
Quick Test
Take the test below to check your understanding. Everyone can take it; only logged-in users will see saved progress.