Why this matters
Automatic metrics alone can miss what people feel and value. As an NLP Engineer, you will often assess qualities like faithfulness, helpfulness, coherence, tone, safety, and bias. Human evaluation helps you:
- Compare model versions when automatic metrics disagree with user feedback.
- Measure subjective qualities (e.g., helpfulness in chat, coherence in summaries).
- Find failure modes for error analysis and guide model iteration.
- Validate safety and fairness risks before deployment.
Concept explained simply
Human evaluation is asking carefully instructed people to judge model outputs using clear criteria and scales. You select tasks, pick representative items, define scoring rules, and ensure raters are trained and unbiased. You then aggregate results and check reliability to make sound decisions.
Mental model
Think of human evaluation as a lab experiment:
- Hypothesis: What you expect to improve (e.g., Model B is more faithful than Model A).
- Treatment: The model outputs.
- Measurement: Rater judgments with well-defined scales.
- Controls: Randomization, blinding, attention checks, and agreement checks.
- Analysis: Aggregation, confidence intervals, and significance tests.
See typical evaluation dimensions
- Faithfulness/Groundedness: Are claims supported by the source or facts?
- Coverage/Adequacy: Does it include key information?
- Coherence: Is it logically organized and consistent?
- Fluency: Grammar and readability.
- Helpfulness/Usefulness: Does it solve the user’s need?
- Safety/Toxicity/Bias: Harmful content, stereotypes, personal data.
Step-by-step setup
- Define objective and hypotheses
Example: Evaluate whether Model B produces more faithful summaries than Model A. - Select data and sample size
Pick representative items. Aim for 50–200 items for initial studies; more if effects are small. - Choose evaluation method
- Scales: Likert 1–5, 0–100 sliders
- Pairwise preference: A vs B with a graded preference (e.g., -2 to +2)
- Binary: Yes/No for safety or correctness
Pick the simplest scale that fits the question. - Write precise instructions
Include definitions, clear rubrics, positive/negative examples, and edge cases. State time expectations and how to handle uncertainty. - Plan raters and quality control
- Rater pool: domain experts vs. general crowd
- Raters per item: 3–5 common
- QC: attention checks, duplicate items, time filters, calibration rounds
- Blinding: hide model identities; randomize order and sides - Pilot and refine
Run on 10–20 items. Fix confusion and timing issues. Adjust criteria as needed. - Run the study
Launch in batches; monitor QC; replace low-quality annotations. - Analyze and report
Aggregate scores, compute confidence intervals, check inter-rater agreement, test significance, and summarize errors found.
Reliability and significance quick guide
- Agreement: Krippendorff’s alpha (works with nominal/ordinal), Fleiss’ kappa (multiple raters), Cohen’s kappa (two raters).
- Significance: Paired t-test (approx. normal means), Wilcoxon signed-rank (non-parametric), permutation/bootstrap tests for robustness.
- Effect size: Cohen’s d (means), Cliff’s delta (ordinal).
Worked examples
Example 1: Summarization (faithfulness, coverage, coherence, fluency)
- Goal: Compare A vs B on 100 news articles.
- Method: 1–5 Likert per dimension; 3 raters per item; blind and randomized.
- QC: 5% attention checks; 10% duplicates for consistency; minimum time per item.
- Analysis: Per-dimension means with 95% CIs (bootstrap), Krippendorff’s alpha (ordinal), Wilcoxon test for A vs B differences.
- Decision rule: Ship B if it improves faithfulness by ≥0.3 points with p<0.05 and similar or better on others.
Example 2: Safety screening (binary)
- Goal: Validate that safety filter rejects unsafe prompts.
- Method: Binary label (Safe/Unsafe) with severity notes. 5 raters per item for borderline topics.
- QC: Specialist raters, content warnings, opt-out. Include seeded unsafe examples.
- Analysis: Majority vote labels; compute precision/recall vs. ground-truth seeds; estimate CI via bootstrap.
- Action: If recall on unsafe < 0.9, review rules and retrain.
Example 3: Chatbot helpfulness (pairwise preference)
- Goal: See if B is preferred over A.
- Method: Pairwise with -2 to +2 scale; sides randomized; identities hidden; 3 raters per pair.
- QC: Duplicate pairs to check consistency; exclude raters with high side bias.
- Analysis: Aggregate via mean preference and Bradley–Terry scores; test with Wilcoxon on per-item medians.
- Outcome: If B’s median preference > 0 with p<0.05, promote B.
Design checklists
Before you launch, use these quick checklists.
Study design
- Objective and hypotheses are specific.
- Representative data sampled and documented.
- Scales/criteria match the goal and are unambiguous.
- Instructions include examples and edge cases.
- Blinding and randomization planned.
- Raters per item and target agreement defined.
- QC: attention checks, duplicates, time checks, calibration.
- Ethics: content warnings, opt-out, fair pay, privacy.
Pilot readiness
- Pilot size ≥ 10–20 items.
- Average annotation time acceptable.
- No frequent rater questions left unresolved.
- Agreement in pilot not alarmingly low.
Exercises
Do these to practice. The Quick Test at the end is available to everyone; only logged-in users get saved progress.
Constraints: 120 articles, two models (A and B), focus on faithfulness first, limited budget. Produce a one-page plan: objective, sample, raters, criteria/scale, instructions outline, randomization, QC, and analysis.
Given 3 items with 3 raters per model on faithfulness (1–5):
Item1 A=[4,4,5], B=[3,3,2]; Item2 A=[2,3,2], B=[3,3,3]; Item3 A=[5,4,5], B=[4,4,4].
Compute per-item means, diff (A−B), overall mean diff, which model wins by majority per item, and which item shows lower agreement. Suggest a fix.
Self-check after exercises
- Are your criteria measurable and aligned with the objective?
- Is blinding and randomization clearly specified?
- Do you have at least two QC methods?
- Did you compute aggregate scores and comment on reliability?
Common mistakes and how to self-check
- Vague criteria. Fix: Add definitions and examples per rating point.
- No pilot. Fix: Run a 10–20 item pilot to catch confusion early.
- No blinding or randomization. Fix: Shuffle items and sides; hide model names.
- Too few raters. Fix: Aim for 3–5 raters per item, more for subjective tasks.
- Ignoring agreement. Fix: Report alpha/kappa or at least consistency checks.
- Mixing multiple dimensions in one score. Fix: Rate each dimension separately.
- Weak QC. Fix: Add attention checks, duplicates, and minimum time thresholds.
- Ethics gaps. Fix: Content warnings, opt-out, privacy, fair pay, safe handling of sensitive data.
Quick self-audit list
- Objective and hypotheses documented
- Data represent target use-case
- Clear instructions with edge cases
- Blinding and randomization plan
- QC: attention checks + duplicates
- Agreement metric and target threshold
- Analysis: CI + appropriate test
Practical projects
- Mini Summarization Study: Pick 50 articles, collect outputs from two models, rate faithfulness and coherence (1–5), 3 raters per item, report mean, CI, and alpha.
- Safety Triage: Curate 100 prompts; raters label Unsafe/Borderline/Safe with notes; report majority labels and recall on seeded unsafe cases.
- Chat Preference Ladder: 40 user prompts, A vs B pairwise with -2..+2 scale; compute mean preference and a simple Bradley–Terry fit or average score; summarize top failure modes.
Who this is for, prerequisites, and learning path
Who this is for
- NLP Engineers validating generative model quality.
- ML practitioners designing user-centered evaluations.
- Analysts conducting safety/fairness reviews.
Prerequisites
- Basic statistics (averages, confidence intervals, significance tests).
- Familiarity with your NLP task (summarization, chat, classification).
- Comfort working with structured instructions and checklists.
Learning path
- Before: Metrics and datasets selection; experimental design basics.
- This subskill: Design and run human evaluations you can trust.
- After: Error analysis from human feedback, iterative model improvement, and combined human+automatic evaluations.
Ethics and safety essentials
- Content warnings and opt-out for sensitive material.
- Annotator well-being and escalation procedures.
- PII handling: redact or minimize exposure.
- Fair compensation and transparent time expectations.
Mini challenge
Draft a one-page annotation guideline for faithfulness in summarization. Include: definition, 1–5 rubric with examples, at least three edge cases, and a 30-second checklist raters must follow before submitting.
Ready to check your understanding? Take the Quick Test below. The test is available to everyone; only logged-in users get saved progress.