Why this matters
As a Prompt Engineer, you ship prompts that must work reliably, not just on a few examples. Clear success criteria and test cases let you:
- Decide quickly if a prompt is good enough to ship.
- Compare versions during iteration and A/B tests.
- Catch regressions when models, prompts, or tools change.
- Communicate quality and trade-offs (accuracy, latency, cost, safety) with stakeholders.
Concept explained simply
Success criteria are the rules for what "good" looks like. Test cases are examples you run to check those rules.
- Objective checks: exact fields, numbers, dates, labels.
- Subjective checks: clarity, tone, usefulness (use rubrics to score).
- Operational checks: speed, cost, safety, constraint adherence.
Mental model
Think of success criteria as a contract, and test cases as courtroom evidence. If the prompt meets the contract across your tests, it passes. If not, you iterate. Make the contract SMART: Specific, Measurable, Achievable, Relevant, Time-bound.
Step-by-step method
- Define the job-to-be-done. What user task should the prompt enable? One sentence.
- List failure modes. Hallucination, missed fields, wrong tone, unsafe content, slow/expensive outputs.
- Turn failures into success criteria. Create measurable rules (e.g., "No unsupported claims", "Extract date in YYYY-MM-DD").
- Design a minimal test set. Include normal cases (happy paths), edge cases, and negatives (should reject).
- Choose scoring methods. Exact match, regex, numeric tolerance, rubric (1–5) with clear anchors, checklist of constraints.
- Set thresholds. Example: "≥ 90% exact extraction accuracy on the gold set; 0 safety violations; median latency ≤ 2s; cost ≤ $0.005 per call."
- Pilot and refine. Run on a small set, inspect misses, adjust criteria or tests if they were ambiguous.
- Version and lock. Freeze a gold test set for regression checks. Add new cases only through review.
Pro tip: balance objective and subjective
Automate objective checks first to speed iteration. Use small, well-defined subjective rubrics for qualities like tone or clarity, and keep them consistent across reviewers.
Worked examples
Example 1: Meeting note summarizer
- Job: Summarize meeting transcripts into 5 bullet points with decisions and action items.
- Success criteria:
- Format: 5 bullets, each ≤ 25 words.
- Coverage: All explicit decisions and owners captured (≥ 90% on gold set).
- No invented facts (0 hallucinations across tests).
- Tone: Neutral and concise (rubric ≥ 4/5).
- Test cases: 12 transcripts (8 normal, 2 with no decisions, 2 with overlapping speakers). Expected bullets listed per case.
Example 2: Invoice field extraction
- Job: Extract vendor_name, invoice_number, invoice_date (YYYY-MM-DD), total_amount.
- Success criteria:
- Schema compliance: All fields present; no extras.
- Exact match for invoice_number; date valid and normalized; amount numeric with two decimals.
- Accuracy: ≥ 95% field-level accuracy on gold set.
- Latency: p95 ≤ 1.5s; Cost ≤ $0.003 per invoice.
- Test cases: 30 invoices (scanned text, different formats, missing PO, European dates). Expected JSON per case.
Example 3: Toxicity moderation
- Job: Classify user text as safe or toxic.
- Success criteria:
- Recall on toxic: ≥ 98% (prioritize safety).
- Precision on toxic: ≥ 90%.
- No unsafe content generated in rationales (0 violations).
- Test cases: 500 labeled texts (balanced, adversarial spelling, coded slurs). Expected labels provided.
Exercises
Do these now. Then compare with the solutions.
Exercise 1: Write success criteria for a product review summarizer
Create 6–8 measurable criteria for a prompt that summarizes a set of product reviews into a short buyer guide. Include at least:
- Format and length constraints
- Coverage and factuality
- Tone and bias checks
- Latency and cost
Show a sample solution
Example criteria:
- Output sections: Pros, Cons, Best for (exact headers).
- Length: ≤ 120 words total.
- Coverage: Mention top 3 recurring pros and top 3 cons (≥ 90% match to labeled set).
- Factuality: 0 claims not supported by the input reviews.
- Tone: Helpful, neutral (rubric ≥ 4/5).
- Bias: No absolute claims (avoid "perfect", "flawless"). Violations = 0.
- Latency: p95 ≤ 2s; Cost ≤ $0.004 per request.
Exercise 2: Design a minimal test set for entity extraction
For extracting person, organization, and date from customer emails, propose 12 test cases with expected outputs. Include:
- 8 normal cases with clear entities
- 2 edge cases with ambiguous names/dates
- 2 negative cases with no entities
Show a sample solution
Structure each case as Input and Expected JSON:
- Input: "Meeting with Dana White from Apex Labs on 03/07/2025." Expected: {"person":"Dana White","organization":"Apex Labs","date":"2025-03-07"}
- Input: "Spoke to Mr. Jordan at Northwind." Expected: {"person":"Jordan","organization":"Northwind","date":null}
- Input: "Let's sync next Friday." Expected: {"person":null,"organization":null,"date":null} (negative)
- Include ambiguous date formats (03/07 can be dd/mm or mm/dd) and specify normalization rules in expectations.
Checklist: good success criteria
- Specific: One idea per criterion.
- Measurable: Objective rule, regex, exact match, rubric with anchors.
- Achievable: Thresholds match realistic performance.
- Relevant: Tied to user job-to-be-done.
- Time-bound/Operational: Latency and cost constraints are explicit.
- Safety: Clear do-not-do constraints with zero tolerance when needed.
Common mistakes and how to self-check
- Vague language ("good," "high-quality"). Self-check: Can two reviewers score it the same? If not, rewrite.
- Only happy-path tests. Self-check: Do you have edge and negative cases?
- No frozen gold set. Self-check: Are you changing tests while comparing versions?
- Ignoring cost/latency. Self-check: Did you set p95 latency and max cost per call?
- Missing safety gates. Self-check: Are unsafe outputs explicitly disallowed with a check?
- Thresholds without rationale. Self-check: Note why each number is chosen (baseline + desired uplift).
Mini self-audit
- At least 3 failure modes identified
- ≥ 10 test cases with expected outputs
- Automated checks defined for objective parts
- Rubric criteria have 1–5 anchors described
Practical projects
- Build a 20-case gold set for a FAQ answerer with exact-match answers and a 1–5 helpfulness rubric. Set ship/no-ship thresholds.
- Create a red team set for a code assistant (prompt injection, harmful requests). Define zero-tolerance rules and negative tests.
- Design a regression pack for an email classifier (spam/ham/marketing). Include drift indicators and p95 latency budget.
Mini challenge
Pick any prompt you use weekly. Write 5 success criteria and 8 test cases (2 edge, 1 negative). Run once, record misses, and adjust two criteria to reduce ambiguity.
Who this is for
- Prompt Engineers defining quality bars before shipping prompts.
- Data scientists and QA engineers evaluating LLM features.
- Product managers who need clear acceptance criteria for AI behavior.
Prerequisites
- Basic prompt design (instructions, examples, constraints).
- Familiarity with evaluation types: exact match, regex, simple rubrics.
- Comfort reading JSON and writing clear expected outputs.
Learning path
- Before: Problem framing and objective selection.
- Now: Define success criteria and build test cases.
- Next: Automate evaluations, run A/B tests, track regression over time.
Next steps
- Automate objective checks with simple scripts or rules.
- Add 5 adversarial cases monthly to your gold set.
- Version your prompts and compare against the frozen gold set before release.
Quick test
Take the quick test below to check mastery. Available to everyone for free. Sign in to save your progress.