How to learn Defining Success Criteria And Test Cases for Evaluation And Iteration in Prompt Engineer for free

Why this matters

As a Prompt Engineer, you ship prompts that must work reliably, not just on a few examples. Clear success criteria and test cases let you:

Decide quickly if a prompt is good enough to ship.
Compare versions during iteration and A/B tests.
Catch regressions when models, prompts, or tools change.
Communicate quality and trade-offs (accuracy, latency, cost, safety) with stakeholders.

Concept explained simply

Success criteria are the rules for what "good" looks like. Test cases are examples you run to check those rules.

Objective checks: exact fields, numbers, dates, labels.
Subjective checks: clarity, tone, usefulness (use rubrics to score).
Operational checks: speed, cost, safety, constraint adherence.

Mental model

Think of success criteria as a contract, and test cases as courtroom evidence. If the prompt meets the contract across your tests, it passes. If not, you iterate. Make the contract SMART: Specific, Measurable, Achievable, Relevant, Time-bound.

Step-by-step method

Define the job-to-be-done. What user task should the prompt enable? One sentence.
List failure modes. Hallucination, missed fields, wrong tone, unsafe content, slow/expensive outputs.
Turn failures into success criteria. Create measurable rules (e.g., "No unsupported claims", "Extract date in YYYY-MM-DD").
Design a minimal test set. Include normal cases (happy paths), edge cases, and negatives (should reject).
Choose scoring methods. Exact match, regex, numeric tolerance, rubric (1–5) with clear anchors, checklist of constraints.
Set thresholds. Example: "≥ 90% exact extraction accuracy on the gold set; 0 safety violations; median latency ≤ 2s; cost ≤ $0.005 per call."
Pilot and refine. Run on a small set, inspect misses, adjust criteria or tests if they were ambiguous.
Version and lock. Freeze a gold test set for regression checks. Add new cases only through review.

Pro tip: balance objective and subjective

Automate objective checks first to speed iteration. Use small, well-defined subjective rubrics for qualities like tone or clarity, and keep them consistent across reviewers.

Worked examples

Example 1: Meeting note summarizer

Job: Summarize meeting transcripts into 5 bullet points with decisions and action items.
Success criteria:

Format: 5 bullets, each ≤ 25 words.
Coverage: All explicit decisions and owners captured (≥ 90% on gold set).
No invented facts (0 hallucinations across tests).
Tone: Neutral and concise (rubric ≥ 4/5).

Test cases: 12 transcripts (8 normal, 2 with no decisions, 2 with overlapping speakers). Expected bullets listed per case.

Example 2: Invoice field extraction

Job: Extract vendor_name, invoice_number, invoice_date (YYYY-MM-DD), total_amount.
Success criteria:

Schema compliance: All fields present; no extras.
Exact match for invoice_number; date valid and normalized; amount numeric with two decimals.
Accuracy: ≥ 95% field-level accuracy on gold set.
Latency: p95 ≤ 1.5s; Cost ≤ $0.003 per invoice.

Test cases: 30 invoices (scanned text, different formats, missing PO, European dates). Expected JSON per case.

Example 3: Toxicity moderation

Job: Classify user text as safe or toxic.
Success criteria:

Recall on toxic: ≥ 98% (prioritize safety).
Precision on toxic: ≥ 90%.
No unsafe content generated in rationales (0 violations).

Test cases: 500 labeled texts (balanced, adversarial spelling, coded slurs). Expected labels provided.

Exercises

Do these now. Then compare with the solutions.

Exercise 1: Write success criteria for a product review summarizer

Create 6–8 measurable criteria for a prompt that summarizes a set of product reviews into a short buyer guide. Include at least:

Format and length constraints
Coverage and factuality
Tone and bias checks
Latency and cost

Show a sample solution

Example criteria:

Output sections: Pros, Cons, Best for (exact headers).
Length: ≤ 120 words total.
Coverage: Mention top 3 recurring pros and top 3 cons (≥ 90% match to labeled set).
Factuality: 0 claims not supported by the input reviews.
Tone: Helpful, neutral (rubric ≥ 4/5).
Bias: No absolute claims (avoid "perfect", "flawless"). Violations = 0.
Latency: p95 ≤ 2s; Cost ≤ $0.004 per request.

Exercise 2: Design a minimal test set for entity extraction

For extracting person, organization, and date from customer emails, propose 12 test cases with expected outputs. Include:

8 normal cases with clear entities
2 edge cases with ambiguous names/dates
2 negative cases with no entities

Show a sample solution

Structure each case as Input and Expected JSON:

Input: "Meeting with Dana White from Apex Labs on 03/07/2025." Expected: {"person":"Dana White","organization":"Apex Labs","date":"2025-03-07"}
Input: "Spoke to Mr. Jordan at Northwind." Expected: {"person":"Jordan","organization":"Northwind","date":null}
Input: "Let's sync next Friday." Expected: {"person":null,"organization":null,"date":null} (negative)
Include ambiguous date formats (03/07 can be dd/mm or mm/dd) and specify normalization rules in expectations.

Checklist: good success criteria

Specific: One idea per criterion.
Measurable: Objective rule, regex, exact match, rubric with anchors.
Achievable: Thresholds match realistic performance.
Relevant: Tied to user job-to-be-done.
Time-bound/Operational: Latency and cost constraints are explicit.
Safety: Clear do-not-do constraints with zero tolerance when needed.

Common mistakes and how to self-check

Vague language ("good," "high-quality"). Self-check: Can two reviewers score it the same? If not, rewrite.
Only happy-path tests. Self-check: Do you have edge and negative cases?
No frozen gold set. Self-check: Are you changing tests while comparing versions?
Ignoring cost/latency. Self-check: Did you set p95 latency and max cost per call?
Missing safety gates. Self-check: Are unsafe outputs explicitly disallowed with a check?
Thresholds without rationale. Self-check: Note why each number is chosen (baseline + desired uplift).

Mini self-audit

At least 3 failure modes identified
≥ 10 test cases with expected outputs
Automated checks defined for objective parts
Rubric criteria have 1–5 anchors described

Practical projects

Build a 20-case gold set for a FAQ answerer with exact-match answers and a 1–5 helpfulness rubric. Set ship/no-ship thresholds.
Create a red team set for a code assistant (prompt injection, harmful requests). Define zero-tolerance rules and negative tests.
Design a regression pack for an email classifier (spam/ham/marketing). Include drift indicators and p95 latency budget.

Mini challenge

Pick any prompt you use weekly. Write 5 success criteria and 8 test cases (2 edge, 1 negative). Run once, record misses, and adjust two criteria to reduce ambiguity.

Who this is for

Prompt Engineers defining quality bars before shipping prompts.
Data scientists and QA engineers evaluating LLM features.
Product managers who need clear acceptance criteria for AI behavior.

Prerequisites

Basic prompt design (instructions, examples, constraints).
Familiarity with evaluation types: exact match, regex, simple rubrics.
Comfort reading JSON and writing clear expected outputs.

Learning path

Before: Problem framing and objective selection.
Now: Define success criteria and build test cases.
Next: Automate evaluations, run A/B tests, track regression over time.

Next steps

Automate objective checks with simple scripts or rules.
Add 5 adversarial cases monthly to your gold set.
Version your prompts and compare against the frozen gold set before release.

Quick test

Take the quick test below to check mastery. Available to everyone for free. Sign in to save your progress.

Menu

Defining Success Criteria And Test Cases

Table of Contents