Why this matters
As a Prompt Engineer, you don’t just write prompts—you ship reliable behaviors. When outputs fail, you must quickly identify what kind of error it is and why it happened. A clear error taxonomy plus root cause analysis (RCA) turns random debugging into a repeatable process. You’ll use this skill to triage production issues, improve prompts, design evaluations, and prevent regressions.
- Triage user reports (e.g., “the model ignored the schema”)
- Diagnose failures in data extraction, RAG answers, and tool-using agents
- Design targeted fixes and regression tests that stick
Concept explained simply
Error taxonomy = naming the type of mistake. Root cause analysis = discovering why it occurred so your fix actually works. Together, they help you move from patching symptoms to preventing the next incident.
Mental model
Think of it like a medical diagnosis:
- Symptom: What you observe (e.g., extra text, wrong format)
- Diagnosis: Standardized label (e.g., schema non-compliance)
- Root cause: The underlying reason (e.g., prompt didn’t show a JSON example)
- Treatment: Minimal change that addresses the cause (e.g., add explicit JSON schema + single example)
Error taxonomy for prompt engineers
Use these categories to label issues consistently. One incident can have multiple categories; pick the primary one first.
1) Hallucination / Unsupported claims
The model states facts not grounded in provided context or known sources.
- Signals: Confident but wrong statements; invented citations
- Typical causes: No retrieval; weak grounding instructions; temperature too high
- Remedies: Require citations; constrain to provided context; add refusal policy for missing evidence
2) Omission / Coverage gaps
Required elements are missing (fields, constraints, edge cases).
- Signals: Partial answers; skipped bullet points
- Causes: Overlong instructions; buried requirements; token budget issues
- Remedies: Make requirements explicit and near the end; use checklists; shorten context
3) Instruction-following failure
Model ignores imperative constraints (tone, length, steps).
- Signals: Violated length/voice; missing steps
- Causes: Competing instructions; vague language (“should” vs “must”)
- Remedies: Use MUST/DO NOT; system messages; numbered constraints
4) Reasoning / Logic error
Flawed deductions or arithmetic mistakes.
- Signals: Contradictions; wrong calculations
- Causes: Insufficient chain-of-thought scaffolding; skipped intermediate steps
- Remedies: Ask for steps; require intermediate variables; use verification prompts
5) Format / Schema non-compliance
Output not in required JSON/CSV/Markdown structure.
- Signals: Extra prose; missing keys; invalid JSON
- Causes: No explicit schema; examples not anchored; temperature too high
- Remedies: Show exact schema; single JSON example; “Output JSON only” guard; add a validator step
6) Context / Reference mismatch
Model uses wrong or truncated context.
- Signals: Answers don’t match the provided doc; outdated references
- Causes: Retrieval misses; truncation; order of messages
- Remedies: Improve retrieval; reduce noise; pin key facts in system prompt
7) Ambiguity / Underspecification
Prompt lacks clarity to resolve trade-offs.
- Signals: Inconsistent outputs across runs; varied formats
- Causes: Vague goals; unclear priority
- Remedies: State priorities; provide tie-breakers; add examples with edge cases
8) Safety / Bias issues
Unsafe, biased, or disallowed content.
- Signals: Harmful advice; stereotypes; PII leakage
- Causes: Missing policies; absent refusal patterns
- Remedies: Embed safety rules; red-team prompts; refusal scaffolds
9) Tool-use / API orchestration error
Agent chooses wrong tool or misreads tool output.
- Signals: Repeated retries; tool with null results
- Causes: Tool descriptions unclear; no success criteria
- Remedies: Clarify tool docs; add selection rules; add verification step before final
10) Non-determinism / Variance
Outputs differ without input changes.
- Signals: Flaky tests; intermittent failures
- Causes: High temperature; randomness; beam differences
- Remedies: Lower temperature; fix seeds where possible; multiple-sample consensus
11) Performance / Latency / Timeout
Too slow or timeouts prevent correct output.
- Signals: Partial responses; retries
- Causes: Large context; slow tools; network timeouts
- Remedies: Slim context; cache; parallelize retrieval; timeouts with fallbacks
Root cause analysis (RCA) workflow
- Triage – Is it severe, frequent, user-visible? Reproduce once.
- Classify – Apply the taxonomy: pick a primary error type (and secondary if needed).
- Localize – Where is it happening? System vs user prompt, examples, retrieval, tools, parameters.
- Hypothesize – Use 5 Whys and form a minimal, testable hypothesis.
- Experiment – Change one variable at a time; collect before/after metrics.
- Fix – Implement the smallest, durable change.
- Guard – Add a regression test to prevent recurrence.
Quick checklist for any incident
- [ ] Symptom captured with a concrete example
- [ ] Error category assigned
- [ ] Single primary suspect located
- [ ] One-variable experiment designed
- [ ] Metric defined (e.g., JSON validity rate)
- [ ] Regression test added after fix
Worked examples
Example 1: Summarization misses constraints
Symptom: “Provide a 3-bullet summary with a title.” Model gives 5 bullets, no title.
- Category: Instruction-following failure; Omission
- Root cause: Constraints are buried mid-paragraph; “should” not “must”
- Fix: Move constraints to end as numbered MUST list; add one ideal example
- Guard: Test that length == 3 and title exists
Before/After prompt snippet
Before: “Please summarize the article. You should aim for 3 bullets and include a title.”
After: “Output MUST: 1) Title on first line 2) Exactly 3 bullets 3) No extra text. Example: Title: ... - Bullet 1 - Bullet 2 - Bullet 3”
Example 2: JSON extraction drifts into prose
Symptom: Model outputs JSON + an explanation sentence.
- Category: Format / Schema non-compliance
- Root cause: No single JSON-only example; no explicit "Output JSON only" rule
- Fix: Provide strict schema, one example, and a DO NOT add prose rule
- Guard: Automatic JSON validator test
Example 3: Agent picks wrong tool
Symptom: Agent queries a web search tool when a local knowledge base would be more precise.
- Category: Tool-use / API orchestration error
- Root cause: Tool descriptions lack selection criteria
- Fix: Add success criteria and decision rules: “Use KB if query mentions internal product codes; else use Web.”
- Guard: Simulated queries with expected tool choices
Exercises
Do these now. They mirror the graded exercises below.
Exercise 1: Identify error type(s) and root cause
Scenario: Your extraction prompt requests exact ISO date format and fields {name, start_date, end_date}. The model returns {"name": "Event" , "dates": "Jan–Mar 2024"} and sometimes adds a note line.
- Task: 1) Label the primary and secondary error categories. 2) Write a minimal hypothesis about the cause. 3) Propose a one-change fix and a guard test.
Need a nudge?
- Look for mismatched keys vs schema.
- Check whether you showed a single JSON-only example.
Exercise 2: Prevent hallucinated sources
Scenario: In a RAG QA system, answers occasionally include fabricated citations when the context lacks evidence.
- Task: 1) Classify the error. 2) Draft a refusal policy line. 3) Suggest one retrieval tweak. 4) Define a metric to track improvement.
Need a nudge?
- Consider instructing the model to say “No direct evidence found.”
- Think about top-k and context quality.
Common mistakes and self-check
- Mistake: Fixing multiple things at once. Self-check: Did I change exactly one variable per test?
- Mistake: Vague labels (“it’s bad”). Self-check: Did I assign a specific taxonomy category?
- Mistake: Overfitting to one example. Self-check: Did I validate on a set of varied cases?
- Mistake: Ignoring evaluation metrics. Self-check: Do I have a numeric success rate (e.g., JSON validity %)?
- Mistake: Missing guardrails. Self-check: Did I add a regression test that would catch this again?
Practical projects
- Build a tiny “error triage” sheet: columns for Symptom, Category, Hypothesis, Change, Metric, Result, Guard.
- Create a 10-case dataset for your prompt and track: format compliance, omission rate, hallucination rate.
- Design an agent tool-selection rubric and test with 8 synthetic queries.
Who this is for
- Prompt Engineers and Data/ML practitioners who need reliable LLM behavior
- Product folks triaging LLM features and user reports
Prerequisites
- Basic prompt engineering (system/user messages, few-shot examples)
- Familiarity with common LLM parameters (temperature, max tokens)
Learning path
- Learn the error taxonomy and memorize 3–5 core categories
- Practice RCA with 5 Whys and single-variable experiments
- Add metrics and guard tests to your workflow
- Apply to RAG, extraction, and agent tasks
Mini challenge
Given a user complaint “The bot keeps ignoring the word limit and sometimes cites blogs we never provided,” write: 1) two categories, 2) a single root cause hypothesis, 3) one minimal fix, and 4) a guard test.
Note: You can take the Quick Test below. Anyone can take it; progress is saved only for logged-in users.