Why Evaluation and Iteration matters for Prompt Engineers
Evaluation and iteration turn prompting from guesswork into an engineering practice. As a Prompt Engineer, you will define success criteria, create test sets, compare prompts through A/B tests, analyze failures, control costs and latency, and version your changes. This skill lets you ship prompts that are reliable, safe, fast, and cost-effective—without breaking what already works.
Who this is for
- Prompt Engineers and ML/AI practitioners who need measurable improvements.
- Data Scientists and Product Managers responsible for LLM features.
- QA Engineers and Analysts building evaluation harnesses.
Prerequisites
- Basic Python or another scripting language to run evaluations.
- Comfort with metrics (accuracy, precision/recall, latency, cost per request).
- Familiarity with LLM prompting patterns (system/user messages, few-shot, and formatting).
Learning path
- Define success criteria and test cases — Write measurable objectives and golden examples with clear expected outputs.
- Build a regression set — Collect diverse, representative prompts and expected outputs; tag them by scenario.
- Baseline metrics — Run the current best prompt on the regression set; record quality, cost, and latency.
- A/B test prompts — Randomly assign traffic or samples to prompts; compute statistical significance.
- Error taxonomy and root cause — Categorize failures to guide targeted fixes.
- Iterative refinement — Change one thing at a time; re-run regression; document outcomes.
- Versioning — Track prompt versions, diffs, and decision notes.
- Cost/latency-quality trade-offs — Optimize to hit SLAs and budgets without losing quality.
Milestones you can tick off
- Have a written success definition and acceptance criteria.
- A labeled regression set with at least 100 varied cases.
- A baseline report with quality, latency (p50/p95), and cost.
- One successful A/B test with a documented decision.
- An error taxonomy with frequency counts and example cases.
- Prompt version history with rationale for each change.
Worked examples
Example 1 — Evaluate classification prompts with accuracy/F1
Task: Classify user intents (billing, tech_support, cancellation). We use a golden labeled set and compute accuracy and macro-F1.
# Pseudo-code; replace call_model with your LLM client
import time
from collections import Counter
gold = [
{"input": "I need to change my credit card", "label": "billing"},
{"input": "App keeps crashing on launch", "label": "tech_support"},
{"input": "Please close my account", "label": "cancellation"},
]
SYSTEM = "Classify the user intent as one of: billing, tech_support, cancellation. Output only the label."
def predict_intent(text):
prompt = f"System: {SYSTEM}\nUser: {text}\nAssistant:"
out = call_model(prompt) # returns a string label
return out.strip().lower()
preds, labels = [], []
start = time.time()
for ex in gold:
p = predict_intent(ex["input"])
preds.append(p)
labels.append(ex["label"])
latency = (time.time() - start) / len(gold)
# Compute accuracy and macro-F1
classes = sorted(set(labels))
cm = {c: Counter() for c in classes}
for y_true, y_pred in zip(labels, preds):
cm[y_true][y_pred] += 1
def f1_for(c):
tp = cm[c][c]
fp = sum(cm[o][c] for o in classes if o != c)
fn = sum(cm[c][o] for o in classes if o != c)
precision = tp / (tp + fp) if tp + fp else 0.0
recall = tp / (tp + fn) if tp + fn else 0.0
return 2*precision*recall / (precision+recall) if precision+recall else 0.0
acc = sum(cm[c][c] for c in classes) / len(labels)
macro_f1 = sum(f1_for(c) for c in classes) / len(classes)
report = {"accuracy": round(acc, 3), "macro_f1": round(macro_f1, 3), "avg_latency_s": round(latency, 3)}
print(report)
Tip: Keep labels constrained with instructions like “Output one of {…}” to reduce free-form errors.
Example 2 — LLM-as-judge for summarization quality
Task: Compare two prompts (A, B) for producing helpful summaries. We use a rubric and an LLM judge that outputs a structured verdict.
judge_system = "You are a strict evaluator. Compare two summaries against the source based on the rubric. Output: WINNER: A|B and REASONS: ..."
rubric = """
Criteria (1-5 each):
- Faithfulness (no fabrication)
- Coverage (key points included)
- Clarity (concise, readable)
Overall winner must be chosen (no ties).
"""
def judge(source, summary_a, summary_b):
prompt = f"System: {judge_system}\nRubric:\n{rubric}\nSOURCE:\n{source}\n---\nSUMMARY A:\n{summary_a}\n---\nSUMMARY B:\n{summary_b}\n"
out = call_model(prompt)
return out
# Parse out WINNER line in post-processing
Record win rates over a representative set. Use a minimum sample size (e.g., 100+ cases) before deciding.
Example 3 — A/B test with significance for binary pass/fail
We randomly assign each test case to Prompt A or B and compute pass rates using a strict checker function.
import math, random
def checker(output, expected):
# strict match or simple normalized comparison
return output.strip().lower() == expected.strip().lower()
n = 200
assignments = ["A" if random.random() < 0.5 else "B" for _ in range(n)]
results = []
for i, arm in enumerate(assignments):
out = run_prompt(arm, test_set[i]["input"]) # produce output via A or B
ok = checker(out, test_set[i]["expected"]) # pass/fail
results.append((arm, ok))
pa = sum(ok for arm, ok in results if arm == "A")
na = sum(1 for arm, ok in results if arm == "A")
pb = sum(ok for arm, ok in results if arm == "B")
nb = sum(1 for arm, ok in results if arm == "B")
p_a, p_b = pa/na, pb/nb
# two-proportion z-test
p_pool = (pa+pb)/(na+nb)
se = math.sqrt(p_pool*(1-p_pool)*(1/na + 1/nb))
z = (p_b - p_a) / se if se else 0
# For two-sided p-value ~ 2 * (1 - Phi(|z|)) – approximate or look up
Decide based on effect size and practical thresholds, not p-value alone.
Example 4 — Cost/latency-quality trade-offs
We compare two prompts and compute estimated cost and latency per 1,000 requests. Varies by country/company; treat as rough ranges.
eval = {
"A": {"quality": 0.81, "avg_tokens": 900, "avg_latency_s": 1.2},
"B": {"quality": 0.84, "avg_tokens": 1400, "avg_latency_s": 1.7}
}
# Suppose $1.50 per 1K tokens (rough; example only)
cost_per_1k = 1.50
for k, v in eval.items():
cost_per_req = (v["avg_tokens"]/1000.0) * cost_per_1k
cost_1k_reqs = cost_per_req * 1000
print(k, {
"quality": v["quality"],
"latency_p50_s": v["avg_latency_s"],
"est_cost_per_1k_reqs_usd": round(cost_1k_reqs, 2)
})
Choose the prompt that meets your minimum quality while hitting your latency and budget goals.
Example 5 — Regression set layout and versioning
Keep a simple, auditable structure:
data/
intents_v1/
meta.json # {"task":"intent", "labels":[...], "created":"2026-01-08"}
cases.jsonl # each line: {"id":"...","input":"...","label":"...","tags":["edge","short"]}
prompts/
intent_prompt_v1.txt
intent_prompt_v2.txt # diff: tightened label output and examples
runs/
2026-01-08_baseline_v1.json # metrics, win rates, latency, cost
2026-01-10_abtest_v1_v2.json # assignments, results, decision notes
Each run stores: prompt version, dataset version, metrics, and a short decision note.
Drills and exercises
- Create a success spec for any LLM task you own. Include inputs, outputs, constraints, and pass criteria.
- Assemble a 100+ case regression set with at least 5 tags (e.g., long, short, adversarial, multilingual, noisy).
- Run a baseline evaluation and record accuracy/F1 (or rubric scores), p50/p95 latency, and estimated cost.
- Design one A/B test plan: sample size, primary metric, and a practical decision threshold.
- Draft an error taxonomy with 5–8 buckets; label at least 30 failures.
- Document a prompt change, rationale, and before/after metrics.
Common mistakes and debugging tips
- Undefined success criteria — Fix by writing explicit pass/fail rules and examples.
- Changing multiple variables at once — Fix by isolating changes per iteration.
- Overfitting to a tiny test set — Fix by expanding and refreshing the regression set with diverse tags.
- Judge leakage — Ensure the judge sees only what it needs; avoid revealing expected answers.
- Ignoring cost/latency — Track tokens and timing in every run; set thresholds.
- No version control — Save each prompt and run with version identifiers and decision notes.
Debugging checklist
- Reproduce a failure case locally with the exact prompt and seed.
- Toggle one instruction at a time (formatting, few-shot examples, output schema).
- Add explicit constraints (allowed labels, max length, JSON schema).
- Strengthen examples that target the top 2–3 error buckets.
- Re-run the regression set and confirm non-regressions.
Mini project: Intent router with evaluation harness
Goal: Build, evaluate, and iterate a prompt that routes user messages to one of several intents while meeting quality and latency targets.
Scope and steps
- Define success: at least 90% accuracy on a 200-case set; p95 latency ≤ 2.0s; estimated cost ≤ $3 per 1k requests (rough example).
- Create dataset: 200 labeled messages, balanced across 5 intents, with edge cases.
- Baseline: Run Prompt v1; record quality, latency, and cost.
- Error analysis: Build taxonomy; find top two error buckets.
- Iterate: Introduce constrained output and 3 few-shot examples targeting errors.
- A/B test: Compare v1 vs v2; compute win rate and significance.
- Versioning: Save versions, diffs, and a decision note.
Deliverables
- success.md describing goals and pass criteria
- cases.jsonl with tags
- prompt_v1.txt, prompt_v2.txt
- report.json with metrics, costs, latency, and error breakdown
- decision.md summarizing A/B results and the chosen prompt
Practical projects
- Summarization judge: Build an LLM-as-judge rubric for meeting notes and report inter-rater agreement (human vs judge).
- Red-teaming pack: Create adversarial prompts for data extraction, jailbreaks, and prompt injection; evaluate containment responses.
- Prompt registry: A simple folder or spreadsheet that tracks prompt IDs, versions, owners, and last evaluation metrics.
Subskills
- Defining Success Criteria And Test Cases — Make outcomes measurable and unambiguous.
- Prompt Benchmarking And Regression Sets — Build representative datasets and keep them versioned.
- A B Testing Prompts — Compare prompts with statistical rigor and practical thresholds.
- Error Taxonomy And Root Cause Analysis — Cluster failures and target high-impact fixes.
- Red Teaming Prompts For Failures — Stress-test prompts with adversarial and edge inputs.
- Iterative Refinement Process — Apply small, controlled changes; re-measure each time.
- Tracking Prompt Versions And Changes — Keep diffs, notes, and run artifacts.
- Measuring Cost Latency Quality — Balance quality with speed and spend; track all three.
Next steps
- Adopt a weekly evaluation cadence with a fixed dataset and a rotating “error of the week.”
- Automate: a script that runs your regression set, aggregates metrics, and writes a report.
- Scale responsibly: add multilingual cases, safety checks, and stress tests before shipping.
Skill exam
This exam checks practical understanding. Everyone can take it for free. If you log in, your progress and results are saved.