Who this is for
- NLP engineers and data scientists evaluating classifiers, sequence labelers, and generation models.
- ML practitioners moving from model training to reliable, decision-focused evaluation.
- Anyone preparing to compare models fairly and diagnose errors with the right metrics.
Prerequisites
- Basic understanding of common NLP tasks (classification, NER, QA, MT, summarization).
- Know confusion matrix terms (TP, FP, TN, FN) and the idea of precision/recall.
- Comfort reading simple formulas and calculating ratios.
Why this matters
Picking the wrong metric gives you the wrong winner. In real NLP work, model choices drive product behavior, safety, and user trust. For example:
- Content moderation: Missing harmful content (FN) can be worse than flagging safe content (FP). Optimize recall or Fbeta > 1, not accuracy.
- NER for contracts: Span-level exact matches matter to extract clauses correctly. Use entity-level F1 with strict matching.
- Summarization: Users care about factuality and coverage. ROUGE helps with overlap, but you must also sample-check faithfulness.
Concept explained simply
Task-appropriate metrics match what success means for your use-case. Ask: What mistake is costly? What output form does the model produce?
Mental model
- Decisions and risks first: Identify which errors (FP or FN) are more harmful.
- Output type drives metric family:
- Classification: accuracy, precision/recall/F1, ROC AUC, PR AUC, calibration (ECE/Brier).
- Sequence labeling: entity-level precision/recall/F1 (strict vs partial), micro vs macro.
- Ranking/Retrieval: Precision@k, Recall@k, MRR, NDCG@k.
- Span extraction/QA: Exact Match (EM), token-level F1.
- Generation (MT/Summarization): BLEU, ROUGE, chrF, BERTScore/COMET. Complement with spot checks for faithfulness.
- Report more than one view: Threshold-dependent (F1), threshold-free (PR/ROC AUC), and calibration if probabilities are used.
Quick recipes (open me)
- Rare positives, cost of misses high: prioritize recall, Fbeta with beta > 1, and PR AUC.
- Balanced binary classification: F1 or accuracy, plus ROC AUC for ranking.
- Multi-label: per-label F1, macro/weighted F1, and coverage error/Hamming loss.
- NER/extraction: entity-level micro F1 (strict spans and types).
- QA (extractive): EM and token-level F1.
- Summarization: ROUGE-1/2/L; add a small factuality checklist.
- MT: chrF or COMET for quality; BLEU for tradition; always sample-check edge cases.
- Retrieval: NDCG@k or MRR, report Recall@k if missing relevant items is costly.
- Calibrated predictions needed: report ECE or Brier score.
Worked examples
1) Imbalanced toxic-content detection (binary classification)
Suppose: TP=18, FP=20, TN=930, FN=32 (1,000 examples; 50 positives).
- Accuracy = (TP+TN)/N = (18+930)/1000 = 0.948
- Precision = TP/(TP+FP) = 18/38 ≈ 0.474
- Recall = TP/(TP+FN) = 18/50 = 0.36
- F1 ≈ 0.409
Insight: Accuracy looks great but recall is low. For safety, optimize recall or Fbeta (beta > 1) and review PR curves; PR AUC is more informative than ROC AUC for rare positives.
2) NER with strict entity matching
Gold: ORG(Apple), PERSON(Tim), GPE(California). Pred: ORG(Apple), PERSON(Tim), LOC(California), ORG(Apple hired).
- TP=2 (Apple, Tim)
- FP=2 (LOC California wrong type; ORG Apple hired wrong span)
- FN=1 (missed correct GPE California)
- Precision=2/4=0.5, Recall=2/3≈0.667, F1≈0.571
Insight: Use entity-level micro F1 with strict spans and types for extraction-sensitive applications.
3) Extractive QA
Gold answer: "san francisco"; Prediction: "san francisco bay area" (whitespace, lowercase, punctuation stripped).
- EM = 0 (not exact)
- Overlap tokens = {san, francisco} = 2
- Precision = 2/4 = 0.5; Recall = 2/2 = 1.0; F1 = 0.667
Insight: Report EM for strictness and token F1 for partial overlap quality.
4) Summarization
For short news summaries, ROUGE-1/2/L capture n-gram and sequence overlap. If coverage matters, emphasize ROUGE recall. Always spot-check factuality; overlap metrics cannot guarantee truthfulness.
5) Retrieval ranking
Compare two retrievers with NDCG@10 and Recall@10. If missing relevant items is costly, favor higher Recall@k even if MRR is similar.
How to choose the right metric in 5 steps
- Define the decision: What will the model’s output be used for? Who is affected by errors?
- Rank error costs: Is FP or FN more costly? Set beta in Fbeta accordingly.
- Match output type: Classification vs sequence labeling vs generation/ranking.
- Select primary metric: One clear winner aligned to risk (e.g., Recall@k, F2, entity-F1).
- Add complementary views: Threshold-free curve (PR/ROC), calibration (ECE/Brier), and qualitative spot checks.
Example decisions to metrics mapping
- Flag harmful content: prioritize recall/F2, monitor precision and PR AUC.
- Contract clause extraction: entity-level micro F1 (strict), error breakdown by type/span.
- Customer support QA: EM and token F1; track failure modes (no-answer, partial, wrong passage).
Exercises
These mirror the exercises below. Do them before checking the solutions.
Exercise 1: Imbalanced classifier metrics
Given TP=18, FP=20, TN=930, FN=32:
- Compute Accuracy, Precision, Recall, and F1.
- Which metric would you optimize and why?
Hint
Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2PR/(P+R).
Exercise 2: NER strict entity F1
Gold: ORG(Apple), PERSON(Tim), GPE(California). Pred: ORG(Apple), PERSON(Tim), LOC(California), ORG(Apple hired).
- Count TP, FP, FN at entity level (exact span and type).
- Compute Precision, Recall, F1.
Hint
Wrong type or wrong span counts as FP, and the missed gold is an FN.
Exercise 3: QA EM and token F1
Gold: "san francisco". Pred: "san francisco bay area" (lowercase, whitespace tokenization).
- Compute EM and token-level Precision, Recall, F1.
Hint
Overlap size over predicted size gives precision; overlap over gold size gives recall.
- Checklist before you peek:
- I wrote the formulas I used.
- I double-checked denominators.
- I explained which metric I’d optimize and why.
Common mistakes and self-check
- Using accuracy on imbalanced data. Self-check: Does a trivial majority baseline get similar accuracy?
- Reporting token-level NER F1 instead of entity-level. Self-check: Are spans and types evaluated exactly?
- Comparing models at different thresholds. Self-check: Fix a threshold or compare with PR/ROC AUC.
- Ignoring calibration. Self-check: If probabilities drive decisions, include ECE or Brier score.
- Trusting overlap metrics for factuality. Self-check: Add a small human spot-check for truthfulness.
- Macro vs micro confusion. Self-check: If label frequencies vary, is macro/weighted F1 reported?
Practical projects
- Error-aware moderation: Build a small classifier on an imbalanced dataset; report PR AUC, F1, and a threshold chosen to hit Recall ≥ 0.8. Summarize trade-offs.
- Extraction audit: Evaluate a NER model with strict entity-level F1. Produce a short error taxonomy: wrong type, boundary errors, missed entities.
- Summarization triage: Compute ROUGE-1/2/L for 50 articles, then manually check 10 for factuality. Write two rules to catch likely hallucinations.
Learning path
- Refresh classification metrics (precision/recall/F1, PR vs ROC AUC, calibration).
- Learn sequence labeling evaluation (entity-level strict vs partial, micro/macro).
- Study ranking metrics (MRR, NDCG@k, Recall@k) and when to prefer each.
- Learn QA/MT/summarization metrics and their limitations.
- Practice by re-evaluating one of your past models with better metrics.
Next steps
- Adopt a primary metric aligned with risk and at least one complementary metric.
- Add a short qualitative review step to your evaluation checklist.
- Proceed to the quick test to confirm understanding.
Mini challenge
You are evaluating a retrieval-augmented QA system where missing relevant documents is more costly than occasionally retrieving an extra irrelevant one. Which primary and secondary metrics will you report at k=10, and why?
Reveal a possible answer
Primary: Recall@10 (ensure relevant docs are retrieved). Secondary: NDCG@10 or MRR to assess ranking quality among retrieved items. If using confidence thresholds for answerability, also track calibration (ECE).
Ready for the quick test?
Take the quick test below. It is available to everyone; only logged-in users get saved progress.