How to learn Human Evaluation Design for Experimentation And Evaluation in Applied Scientist for free

Why this matters

As an Applied Scientist, you will often ship models whose success cannot be judged by automated metrics alone. Human evaluation is essential when outcomes involve quality, usefulness, safety, or preference. Typical tasks:

Comparing two model versions via blind pairwise preference.
Judging faithfulness, toxicity, or helpfulness of LLM outputs.
Rating search/recommendation relevance for new ranking algorithms.
Validating whether generated content follows style and policy.

Well-designed human evaluations give reliable, defensible decisions. Poorly designed ones waste time, money, and can mislead product choices.

Concept explained simply

Human evaluation design is the process of specifying what people will judge, how they will judge it, who the raters are, and how you will analyze the judgments to answer a product question.

A simple mental model

Ask the right question: Define the decision you want to make (ship A or B? acceptable risk?).
Build the right task: Clear instructions, examples, rubric, and unbiased presentation.
Collect from the right people: Trained raters, representative users, or domain experts.
Analyze correctly: Reliability checks, aggregation, uncertainty, and significance testing.

Quick glossary

Rubric: Rules and anchors explaining how to rate.
Inter-rater reliability (IRR): Agreement beyond chance (e.g., Cohen’s kappa, Krippendorff’s alpha).
Blinding: Hiding system identity from raters.
Counterbalancing: Rotating order to remove position bias.
Gold/attention checks: Items with known correct answers to verify attention.

Key design components

1) Objective and decision

Decision first: e.g., “Ship B if it is preferred over A by at least 55% ± 5% at 95% confidence.”
Metric family: preference, Likert ratings, pass/fail, ranking, or categorical labeling.

2) Units and sampling

Define evaluation unit: query, prompt, conversation turn, or document.
Sampling: stratify by user segment, topic, language, or difficulty to match production mix.
Size planning: estimate needed judgments using expected effect size and variance; pilot first.

3) Task, instructions, and rubric

Instructions: concise, with good/bad examples.
Rubric shape: binary, 5-point Likert, pairwise preference, or ranking.
Anchors: concrete, behavior-based, and mutually exclusive.

Example Likert anchors for "Faithfulness"

1: Mostly incorrect or fabricated.
2: Several factual errors or contradictions.
3: Minor inaccuracies; generally consistent.
4: Accurate with negligible issues.
5: Completely accurate and consistent with source.

4) Bias controls

Blind systems; randomize order; counterbalance positions.
Hide metadata that could reveal model identity.
Use neutral wording, not “new” vs “old”.

5) Raters

Choose expertise level (domain experts vs crowd vs internal).
Train with calibration items and feedback.
Incentives: pay fairly; limit session length (e.g., ≤20–30 minutes).

6) Quality and reliability

Gold questions, duplicated items, time-on-task checks.
IRR targets: kappa/alpha ≥ 0.6 is acceptable; ≥ 0.75 is strong (context-dependent).
Filter inattentive raters before final analysis.

7) Ethics and privacy

Remove personally identifiable information from stimuli when possible.
Warn about sensitive content; allow opt-out.
Respect local norms and accessibility needs.

8) Analysis plan (before running)

Aggregation: majority vote for categorical, mean/median for Likert, win-rate for pairwise.
Uncertainty: confidence intervals via normal approx or bootstrap.
Significance: binomial test (pairwise), t-test or Wilcoxon (Likert), permutation/ bootstrap for robust comparisons.
Report: effect size, CI, p-value, and practical decision (ship/hold).

Worked examples

Example 1: LLM summarization A/B

Objective: Decide if B is better than A on faithfulness and overall preference.
Sampling: 300 articles stratified by length (short/med/long) and topic (news, sports, tech).
Task: Pairwise preference (A vs B) with mandatory 1-sentence justification; separate 5-point faithfulness rating for the chosen summary.
Bias controls: Blind labels, randomize order, counterbalance positions.
Raters: Trained crowd workers fluent in English; 5 raters per item.
Quality: 10% gold items, min time-on-task 15s, duplicate 5% items to measure consistency.
Analysis: Compute win-rate of B; binomial test with 95% CI. For faithfulness, use Wilcoxon signed-rank on per-item medians. Filter raters failing gold or extreme speeders.
Decision rule: Ship if B win-rate ≥ 55% and faithfulness median difference ≥ +0.3 with p ≤ 0.05.

Example 2: ASR usability for domain calls

Objective: Is the new ASR transcript acceptable for customer support workflows?
Sampling: 200 call clips covering accents and noise levels.
Task: Binary acceptability (usable/not usable) + 5-point clarity rating with anchors.
Raters: Support agents (domain experts), 3 per clip.
Quality: Calibration with 20 labeled clips; IRR target kappa ≥ 0.6.
Analysis: Majority vote acceptability; compare proportions via two-proportion z-test; clarity via Mann–Whitney U.
Decision: Roll out if acceptable rate +5–8 pp with CI excluding 0.

Example 3: Search ranking relevance

Objective: Compare baseline vs new ranker on top-5 relevance.
Pooling: For 500 queries, pool top-5 from each system; deduplicate.
Task: 3-point relevance (Not/Partially/Highly) with clear examples.
Presentation: Judge document relevance to the query without system identity.
Raters: 4 trained judges per query-document pair.
Quality: 8% golds; Krippendorff’s alpha target ≥ 0.65.
Analysis: Compute NDCG-like offline metric using human labels; bootstrap CI across queries; paired comparison per query.
Decision: Ship if median per-query delta > 0 with 95% CI > 0.

How to analyze results

Clean: Remove failed golds, extreme speeders, or inconsistent raters (pre-declared rules).
IRR: Report kappa/alpha and note limitations.
Aggregate: Per item first, then average across items/strata to avoid rater imbalance.
Uncertainty: Use bootstrap over items to get robust CIs.
Sensitivity: Re-run with/without stricter filters to check stability.

Choosing a test

Pairwise win-rate: binomial test or bootstrap CI.
Likert (ordinal): Wilcoxon signed-rank or Mann–Whitney U; report effect size (e.g., rank-biserial).
Categorical label accuracy: McNemar’s test for paired designs.

Who this is for

Applied Scientists deciding if a model or prompt should ship.
ML Engineers validating changes that impact user perception.
Data Scientists running A/Bs where automatic metrics are insufficient.

Prerequisites

Basic statistics: confidence intervals, p-values, effect sizes.
Familiarity with your product’s users and tasks.
Comfort with spreadsheets or Python/R for simple analyses.

Learning path

Define decision and success criteria.
Draft task, rubric, and examples.
Plan sampling and size; run a small pilot (20–50 items).
Train raters; calibrate with feedback.
Launch; monitor quality; pause if IRR is low.
Analyze with pre-registered plan; report results and decision.

Exercises

Do these now. Answers are below each exercise in collapsible sections.

Exercise 1: Design a pairwise human evaluation

Goal: Write a one-page spec to compare two LLM summarization systems.

Include: objective/decision rule, sampling plan, task design, rubric, raters, quality controls, analysis plan.
Keep it realistic and concise.

Show solution

Example spec:

Objective: Decide if B outperforms A in user preference by ≥5 pp at 95% CI.
Sampling: 300 news articles, stratified by topic and length; 5 raters/item.
Task: Blind pairwise A vs B; choose better summary; provide 1-sentence reason; tie allowed.
Rubric: Prefer summaries that are faithful, concise, and comprehensive; ties when indistinguishable.
Raters: English-fluent crowd; trained with 10 examples.
Quality: 10% golds; 5% duplicates; min 15s/item; block raters failing ≥30% golds.
Analysis: B win-rate, ties excluded; binomial 95% CI; bootstrap sensitivity including ties as 0.5; report effect size and CI.
Decision: Ship if lower CI bound ≥ +5 pp.

Exercise 2: Craft a 5-point rubric

Create a 5-point rubric for "Helpfulness" of chatbot responses.

Provide clear anchors 1–5 with examples of what qualifies.
Make anchors observable and non-overlapping.

Show solution

1: Not helpful. Off-topic or refuses without reason; missing key info.
2: Low helpfulness. Partially addresses request but major gaps or unclear steps.
3: Adequate. Addresses main request with basic steps; minor gaps allowed.
4: Helpful. Complete, actionable, correctly prioritized steps; minor polish issues.
5: Highly helpful. Complete, concise, tailored to user context; includes pitfalls and next actions.

Exercise checklist

Your design names a decision rule with CI or significance.
Your rubric uses concrete, observable anchors.
You included blinding and randomization.
You planned golds/duplicates and IRR measurement.
You specified aggregation and statistical test.

Common mistakes and self-check

Vague success criteria. Fix: Predefine effect size and CI or p-value thresholds.
No blinding or randomization. Fix: Hide system IDs and rotate positions.
Unrepresentative sampling. Fix: Stratify to match production distribution.
Over-reliance on mean Likert as interval. Fix: Prefer medians or nonparametric tests; report distribution.
Ignoring reliability. Fix: Measure kappa/alpha; retrain or refine rubric if low.
Insufficient quality controls. Fix: Add golds, duplicates, time checks.
Underpowered study. Fix: Pilot to estimate variance; increase items or raters.

Self-check mini list

Can a stranger reproduce your study from the spec?
Would your decision be the same if you drop low-quality raters?
Did you define how to handle ties and missing data?

Practical projects

Build a small evaluation of two prompt variants for code generation using 100 programming tasks and pairwise preference with failure categories.
Design and run a relevance labeling task for a toy search dataset (200 queries), compute IRR, and compare two rankers with bootstrap CIs.
Create a safety evaluation for chatbot responses with a pass/fail rubric, high-precision golds, and an escalation path for borderline cases.

Next steps

Run a 30–50 item pilot of your design; refine instructions and examples.
Automate data capture and randomization in your tooling; keep an analysis notebook template.
Share a one-page readout template: objective, method, results (effect + CI), decision, risks.

Mini challenge

In two sentences, define a ship/hold rule for your current project that includes both effect size and uncertainty. Then list two threats to validity and how you will detect or mitigate them.

Note: The Quick Test is available to everyone. Log in to save your progress.

Menu

Human Evaluation Design

Table of Contents