Table of Contents
Step-by-step setup
- Define the decision
What will you do differently based on results? Set numeric thresholds. - Translate goals to rubric
Create observable criteria with 2–5 anchors and examples. - Plan sampling
Stratify by intent, difficulty, language, and user segments. Include edge cases. - Draft tasks
Clear instructions, anonymized outputs, time estimate per item, and don’t reveal model identity. - Choose raters
Internal SMEs for nuance; crowd for scale. Provide training and practice rounds. - Quality controls
Gold questions, attention checks, calibration sessions, IRR measurement. - Run pilot
10–15% of total. Adjust rubric until IRR is acceptable. - Scale and aggregate
Collect full set, compute metrics, slice by segment, and make a clear decision.
Quality control tips
- Blind the model identity and randomize order.
- Use gold items (pre-labeled) to monitor rater accuracy in-flight.
- Track IRR at least daily; pause raters below threshold for retraining.
- Keep dev, validation, and test samples strictly separate.
Practical projects
- Project 1: Build a 3-criterion rubric (helpfulness, correctness, tone) and evaluate 100 model responses from your product’s top intent. Compute mean per criterion and IRR.
- Project 2: Run a pairwise preference test between two prompt versions (n ≥ 400 judgments). Report win rate with confidence interval and segment breakdown.
- Project 3: Safety audit — create a 200-prompt adversarial set, measure pass rate, and propose policy/rubric changes from failure tags.
Exercises (hands-on)
Try these now. The quick test at the end is available to everyone. Note: only logged-in users have their progress saved.
Exercise 1 — Design a rater rubric for AI email reply suggestions
Your product suggests email replies to customer messages. Create a rubric and task plan.
- Define 4–5 criteria with 1–5 anchors and examples.
- Specify pass/fail guardrails for safety.
- Describe sampling across easy/ambiguous cases.
- Draft rater instructions (what to look for, what to ignore).
Hints
- Keep anchors observable (e.g., “contains actionable next step”).
- Include brevity/clarity and tone as separate criteria.
- Add at least 10 gold questions.
Show solution
Example rubric and plan:
- Relevance: 1 off-topic → 5 directly addresses the user’s question.
- Correctness: 1 factual errors → 5 factually accurate; cites policy or steps correctly.
- Actionability: 1 no next step → 5 clear, specific next step or question.
- Tone: 1 rude → 5 professional, empathetic, concise.
- Safety (guardrail): Fail if contains prohibited content or discloses private data.
Sampling: 300 messages; 60 ambiguous; 30 multilingual. Instructions: Blind model identity; rate each criterion 1–5; mark Fail if safety violated; note any ambiguity. Include 12 gold items; IRR target kappa ≥ 0.6.
Exercise 2 — Compute agreement and decide next actions
Two raters judged 20 outputs as Pass/Fail. Confusion matrix (A rows, B columns):
- Pass/Pass: 9
- Pass/Fail: 3
- Fail/Pass: 2
- Fail/Fail: 6
- Compute percent agreement and Cohen’s kappa.
- Decide if IRR is acceptable (target ≥ 0.6).
- Propose 2 actions to improve if below target.
Hints
- Agreement = (9+6)/20.
- Expected agreement = (A pass rate × B pass rate) + (A fail rate × B fail rate).
- Kappa = (observed − expected) / (1 − expected).
Show solution
Observed agreement = (9+6)/20 = 0.75.
A pass rate = (9+3)/20 = 0.60; A fail = 0.40.
B pass rate = (9+2)/20 = 0.55; B fail = 0.45.
Expected = 0.60×0.55 + 0.40×0.45 = 0.33 + 0.18 = 0.51.
Kappa = (0.75 − 0.51) / (1 − 0.51) = 0.24 / 0.49 ≈ 0.49.
IRR below 0.6 → not acceptable. Actions: clarify rubric anchors with examples; run a calibration round and remove ambiguous items from gold set; provide rater training and feedback.
Pre-launch checklist
- Decision and numeric thresholds defined.
- Rubric has clear, observable anchors with examples.
- Sampling covers segments, difficulty, and edge cases.
- Raters trained; model identity blinded.
- Gold questions and attention checks included.
- IRR ≥ target on pilot; rubric adjusted as needed.
- Aggregation and reporting plan defined.
Common mistakes and how to self-check
- Vague rubrics: If two raters disagree often, your anchors are not observable. Add examples per score point.
- Sampling bias: Over-reliance on easy cases. Compare segment performance and re-balance.
- Rater leakage: Raters can guess model identity or prompt — always anonymize and randomize order.
- Mixing datasets: Using the same items for tuning and testing. Keep strict separation; rotate fresh test sets.
- Ignoring IRR: High average scores can still be unreliable. Track kappa and pause if it drops.
- No error taxonomy: Without tagging failure types, you won’t know what to fix. Create 5–7 clear error tags.
- Underestimating cost/time: Estimate judgments × time per item and budget upfront.
Who this is for
- AI Product Managers shipping LLM or ML features.
- Data PMs and UX researchers supporting AI quality.
- Engineers who own model evaluation and monitoring.
Prerequisites
- Basic understanding of your AI use case and policies.
- Familiarity with evaluation metrics (accuracy, precision/recall) and A/B testing concepts.
- Comfort with spreadsheets for aggregation and IRR calculations.
Learning path
- Define decision and success criteria.
- Draft rubric with concrete anchors and examples.
- Pilot with 50–100 items; reach IRR target.
- Scale to full run; slice results by segment and error type.
- Integrate ongoing sampling for post-launch monitoring.
Next steps
- Turn your rubric into a template for future experiments.
- Automate weekly sampling for production monitoring.
- Share a one-page report with thresholds, IRR, and decisions.
Mini challenge
In 15 minutes, draft a one-page HITL plan for your product: decision, rubric (3 criteria), sample size, IRR target, and a go/no-go threshold. Share it with your team for feedback.
Progress note: The quick test below is open to everyone. Log in if you want your progress saved.