Menu

Evaluation And Iteration

Learn Evaluation And Iteration for Prompt Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 8, 2026 | Updated: January 8, 2026

Why Evaluation and Iteration matters for Prompt Engineers

Evaluation and iteration turn prompting from guesswork into an engineering practice. As a Prompt Engineer, you will define success criteria, create test sets, compare prompts through A/B tests, analyze failures, control costs and latency, and version your changes. This skill lets you ship prompts that are reliable, safe, fast, and cost-effective—without breaking what already works.

Who this is for

  • Prompt Engineers and ML/AI practitioners who need measurable improvements.
  • Data Scientists and Product Managers responsible for LLM features.
  • QA Engineers and Analysts building evaluation harnesses.

Prerequisites

  • Basic Python or another scripting language to run evaluations.
  • Comfort with metrics (accuracy, precision/recall, latency, cost per request).
  • Familiarity with LLM prompting patterns (system/user messages, few-shot, and formatting).

Learning path

  1. Define success criteria and test cases — Write measurable objectives and golden examples with clear expected outputs.
  2. Build a regression set — Collect diverse, representative prompts and expected outputs; tag them by scenario.
  3. Baseline metrics — Run the current best prompt on the regression set; record quality, cost, and latency.
  4. A/B test prompts — Randomly assign traffic or samples to prompts; compute statistical significance.
  5. Error taxonomy and root cause — Categorize failures to guide targeted fixes.
  6. Iterative refinement — Change one thing at a time; re-run regression; document outcomes.
  7. Versioning — Track prompt versions, diffs, and decision notes.
  8. Cost/latency-quality trade-offs — Optimize to hit SLAs and budgets without losing quality.
Milestones you can tick off
  • Have a written success definition and acceptance criteria.
  • A labeled regression set with at least 100 varied cases.
  • A baseline report with quality, latency (p50/p95), and cost.
  • One successful A/B test with a documented decision.
  • An error taxonomy with frequency counts and example cases.
  • Prompt version history with rationale for each change.

Worked examples

Example 1 — Evaluate classification prompts with accuracy/F1

Task: Classify user intents (billing, tech_support, cancellation). We use a golden labeled set and compute accuracy and macro-F1.

# Pseudo-code; replace call_model with your LLM client
import time
from collections import Counter

gold = [
  {"input": "I need to change my credit card", "label": "billing"},
  {"input": "App keeps crashing on launch", "label": "tech_support"},
  {"input": "Please close my account", "label": "cancellation"},
]

SYSTEM = "Classify the user intent as one of: billing, tech_support, cancellation. Output only the label."

def predict_intent(text):
    prompt = f"System: {SYSTEM}\nUser: {text}\nAssistant:"
    out = call_model(prompt)  # returns a string label
    return out.strip().lower()

preds, labels = [], []
start = time.time()
for ex in gold:
    p = predict_intent(ex["input"])
    preds.append(p)
    labels.append(ex["label"])
latency = (time.time() - start) / len(gold)

# Compute accuracy and macro-F1
classes = sorted(set(labels))
cm = {c: Counter() for c in classes}
for y_true, y_pred in zip(labels, preds):
    cm[y_true][y_pred] += 1

def f1_for(c):
    tp = cm[c][c]
    fp = sum(cm[o][c] for o in classes if o != c)
    fn = sum(cm[c][o] for o in classes if o != c)
    precision = tp / (tp + fp) if tp + fp else 0.0
    recall = tp / (tp + fn) if tp + fn else 0.0
    return 2*precision*recall / (precision+recall) if precision+recall else 0.0

acc = sum(cm[c][c] for c in classes) / len(labels)
macro_f1 = sum(f1_for(c) for c in classes) / len(classes)

report = {"accuracy": round(acc, 3), "macro_f1": round(macro_f1, 3), "avg_latency_s": round(latency, 3)}
print(report)

Tip: Keep labels constrained with instructions like “Output one of {…}” to reduce free-form errors.

Example 2 — LLM-as-judge for summarization quality

Task: Compare two prompts (A, B) for producing helpful summaries. We use a rubric and an LLM judge that outputs a structured verdict.

judge_system = "You are a strict evaluator. Compare two summaries against the source based on the rubric. Output: WINNER: A|B and REASONS: ..."

rubric = """
Criteria (1-5 each):
- Faithfulness (no fabrication)
- Coverage (key points included)
- Clarity (concise, readable)
Overall winner must be chosen (no ties).
"""

def judge(source, summary_a, summary_b):
    prompt = f"System: {judge_system}\nRubric:\n{rubric}\nSOURCE:\n{source}\n---\nSUMMARY A:\n{summary_a}\n---\nSUMMARY B:\n{summary_b}\n"
    out = call_model(prompt)
    return out

# Parse out WINNER line in post-processing

Record win rates over a representative set. Use a minimum sample size (e.g., 100+ cases) before deciding.

Example 3 — A/B test with significance for binary pass/fail

We randomly assign each test case to Prompt A or B and compute pass rates using a strict checker function.

import math, random

def checker(output, expected):
    # strict match or simple normalized comparison
    return output.strip().lower() == expected.strip().lower()

n = 200
assignments = ["A" if random.random() < 0.5 else "B" for _ in range(n)]
results = []
for i, arm in enumerate(assignments):
    out = run_prompt(arm, test_set[i]["input"])  # produce output via A or B
    ok = checker(out, test_set[i]["expected"])   # pass/fail
    results.append((arm, ok))

pa = sum(ok for arm, ok in results if arm == "A")
na = sum(1 for arm, ok in results if arm == "A")
pb = sum(ok for arm, ok in results if arm == "B")
nb = sum(1 for arm, ok in results if arm == "B")

p_a, p_b = pa/na, pb/nb
# two-proportion z-test
p_pool = (pa+pb)/(na+nb)
se = math.sqrt(p_pool*(1-p_pool)*(1/na + 1/nb))
z = (p_b - p_a) / se if se else 0
# For two-sided p-value ~ 2 * (1 - Phi(|z|)) – approximate or look up

Decide based on effect size and practical thresholds, not p-value alone.

Example 4 — Cost/latency-quality trade-offs

We compare two prompts and compute estimated cost and latency per 1,000 requests. Varies by country/company; treat as rough ranges.

eval = {
  "A": {"quality": 0.81, "avg_tokens": 900, "avg_latency_s": 1.2},
  "B": {"quality": 0.84, "avg_tokens": 1400, "avg_latency_s": 1.7}
}

# Suppose $1.50 per 1K tokens (rough; example only)
cost_per_1k = 1.50

for k, v in eval.items():
    cost_per_req = (v["avg_tokens"]/1000.0) * cost_per_1k
    cost_1k_reqs = cost_per_req * 1000
    print(k, {
        "quality": v["quality"],
        "latency_p50_s": v["avg_latency_s"],
        "est_cost_per_1k_reqs_usd": round(cost_1k_reqs, 2)
    })

Choose the prompt that meets your minimum quality while hitting your latency and budget goals.

Example 5 — Regression set layout and versioning

Keep a simple, auditable structure:

data/
  intents_v1/
    meta.json  # {"task":"intent", "labels":[...], "created":"2026-01-08"}
    cases.jsonl  # each line: {"id":"...","input":"...","label":"...","tags":["edge","short"]}

prompts/
  intent_prompt_v1.txt
  intent_prompt_v2.txt  # diff: tightened label output and examples

runs/
  2026-01-08_baseline_v1.json  # metrics, win rates, latency, cost
  2026-01-10_abtest_v1_v2.json  # assignments, results, decision notes

Each run stores: prompt version, dataset version, metrics, and a short decision note.

Drills and exercises

  • Create a success spec for any LLM task you own. Include inputs, outputs, constraints, and pass criteria.
  • Assemble a 100+ case regression set with at least 5 tags (e.g., long, short, adversarial, multilingual, noisy).
  • Run a baseline evaluation and record accuracy/F1 (or rubric scores), p50/p95 latency, and estimated cost.
  • Design one A/B test plan: sample size, primary metric, and a practical decision threshold.
  • Draft an error taxonomy with 5–8 buckets; label at least 30 failures.
  • Document a prompt change, rationale, and before/after metrics.

Common mistakes and debugging tips

  • Undefined success criteria — Fix by writing explicit pass/fail rules and examples.
  • Changing multiple variables at once — Fix by isolating changes per iteration.
  • Overfitting to a tiny test set — Fix by expanding and refreshing the regression set with diverse tags.
  • Judge leakage — Ensure the judge sees only what it needs; avoid revealing expected answers.
  • Ignoring cost/latency — Track tokens and timing in every run; set thresholds.
  • No version control — Save each prompt and run with version identifiers and decision notes.
Debugging checklist
  • Reproduce a failure case locally with the exact prompt and seed.
  • Toggle one instruction at a time (formatting, few-shot examples, output schema).
  • Add explicit constraints (allowed labels, max length, JSON schema).
  • Strengthen examples that target the top 2–3 error buckets.
  • Re-run the regression set and confirm non-regressions.

Mini project: Intent router with evaluation harness

Goal: Build, evaluate, and iterate a prompt that routes user messages to one of several intents while meeting quality and latency targets.

Scope and steps
  1. Define success: at least 90% accuracy on a 200-case set; p95 latency ≤ 2.0s; estimated cost ≤ $3 per 1k requests (rough example).
  2. Create dataset: 200 labeled messages, balanced across 5 intents, with edge cases.
  3. Baseline: Run Prompt v1; record quality, latency, and cost.
  4. Error analysis: Build taxonomy; find top two error buckets.
  5. Iterate: Introduce constrained output and 3 few-shot examples targeting errors.
  6. A/B test: Compare v1 vs v2; compute win rate and significance.
  7. Versioning: Save versions, diffs, and a decision note.
Deliverables
  • success.md describing goals and pass criteria
  • cases.jsonl with tags
  • prompt_v1.txt, prompt_v2.txt
  • report.json with metrics, costs, latency, and error breakdown
  • decision.md summarizing A/B results and the chosen prompt

Practical projects

  • Summarization judge: Build an LLM-as-judge rubric for meeting notes and report inter-rater agreement (human vs judge).
  • Red-teaming pack: Create adversarial prompts for data extraction, jailbreaks, and prompt injection; evaluate containment responses.
  • Prompt registry: A simple folder or spreadsheet that tracks prompt IDs, versions, owners, and last evaluation metrics.

Subskills

  • Defining Success Criteria And Test Cases — Make outcomes measurable and unambiguous.
  • Prompt Benchmarking And Regression Sets — Build representative datasets and keep them versioned.
  • A B Testing Prompts — Compare prompts with statistical rigor and practical thresholds.
  • Error Taxonomy And Root Cause Analysis — Cluster failures and target high-impact fixes.
  • Red Teaming Prompts For Failures — Stress-test prompts with adversarial and edge inputs.
  • Iterative Refinement Process — Apply small, controlled changes; re-measure each time.
  • Tracking Prompt Versions And Changes — Keep diffs, notes, and run artifacts.
  • Measuring Cost Latency Quality — Balance quality with speed and spend; track all three.

Next steps

  • Adopt a weekly evaluation cadence with a fixed dataset and a rotating “error of the week.”
  • Automate: a script that runs your regression set, aggregates metrics, and writes a report.
  • Scale responsibly: add multilingual cases, safety checks, and stress tests before shipping.

Skill exam

This exam checks practical understanding. Everyone can take it for free. If you log in, your progress and results are saved.

Evaluation And Iteration — Skill Exam

15 questions. ~15–20 minutes. Open-notes allowed. Everyone can take it for free. Logged-in users have their progress and results saved.Passing score: 70%. You can retake anytime.

15 questions70% to pass

Have questions about Evaluation And Iteration?

AI Assistant

Ask questions about this tool