How to learn NLP Evaluation And Error Analysis for NLP Engineer for free

Why this skill matters for NLP Engineers

Strong evaluation and error analysis is how NLP engineers ship reliable models. It helps you pick the right metrics for the task, find failure patterns, compare versions fairly, and decide what to fix next. With this skill you will: choose metrics that reflect product goals, diagnose errors with confusion and slice analysis, run human evaluations for generation tasks, design regression tests, check robustness and fairness, and track model performance over time.

What you’ll learn

Choose task-appropriate metrics (classification, sequence labeling, retrieval, QA, generation).
Run confusion and slice analysis to find where models fail.
Audit labels and conduct qualitative reviews to spot data issues.
Test robustness to noise and simple adversarial edits.
Run basic bias/fairness checks and report group gaps.
Build and maintain regression test sets.
Design human evaluation rubrics and annotation workflows.
Track model changes over time with clear versioning and gates.

Who this is for

NLP Engineers and ML practitioners shipping text models to production.
Data scientists moving from model prototyping to reliable deployment.
QA/Analyst roles supporting ML evaluation.

Prerequisites

Python basics and ability to run notebooks.
Familiarity with common NLP tasks (classification, NER, QA, summarization).
Basic statistics: precision/recall/F1, averages, confidence intervals.

Learning path

Pick the right metric: Map product goals to metrics (accuracy vs macro-F1, ROUGE vs human ratings, etc.).
Quantify errors: Compute confusion matrices and per-slice metrics; inspect worst slices first.
Qualitative pass: Read 50–100 errors; label error themes; audit gold labels.
Stress tests: Add noise/perturbations; check robustness and fairness slices.
Lock coverage: Build regression test sets for key scenarios.
Human eval: Design a rubric and run a small pilot.
Track versions: Log metrics with dataset/model versions; define accept/reject gates.

Worked examples

Example 1: Text classification — macro vs micro F1, confusion and slice analysis

Use macro-F1 for imbalanced classes to weight each class equally; micro-F1 for global performance when class balance matters.

# Example: news topic classification
from sklearn.metrics import confusion_matrix, classification_report, f1_score
import numpy as np

y_true = ["sports","sports","politics","tech","tech","tech","health","health","health","health"]
y_pred = ["sports","tech","politics","tech","tech","sports","health","health","tech","health"]

labels = sorted(set(y_true))
print(classification_report(y_true, y_pred, labels=labels, digits=3))
print("micro-F1:", f1_score(y_true, y_pred, average="micro"))
print("macro-F1:", f1_score(y_true, y_pred, average="macro"))
print("Confusion:\n", confusion_matrix(y_true, y_pred, labels=labels))

# Simple slice analysis by presence of a keyword
docs = [
    "Local sports team wins cup", "Trade rumors in soccer", "Debate in parliament",
    "New GPU features", "AI startup raises", "Coach resigns",
    "Healthy diet tips", "Mental health awareness", "Sleep science", "Hospital funding"
]
contains_health_word = [int("health" in d.lower()) for d in docs]

# Compute per-slice F1
slices = {"has_health_word": [i for i,v in enumerate(contains_health_word) if v==1],
          "no_health_word": [i for i,v in enumerate(contains_health_word) if v==0]}
for s, idxs in slices.items():
    yt = [y_true[i] for i in idxs]
    yp = [y_pred[i] for i in idxs]
    print(s, "macro-F1:", f1_score(yt, yp, average="macro"))

Action: Prioritize slices with the largest drop vs overall macro-F1 and read a few misclassified examples to find patterns.

Example 2: NER — entity-level evaluation (span precision/recall/F1)

Evaluate exact-match entities, not tokens, when span correctness matters.

# Gold and predicted entity spans: (start, end, label) with end exclusive
GOLD = {"Alice lives in Paris": [(0,5,"PER"),(15,20,"LOC")]}
PRED = {"Alice lives in Paris": [(0,5,"PER"),(15,20,"ORG")]}

def prf_entity(gold_spans, pred_spans):
    gold = set(gold_spans)
    pred = set(pred_spans)
    tp = len(gold & pred)
    fp = len(pred - gold)
    fn = len(gold - pred)
    prec = tp/(tp+fp) if tp+fp else 0
    rec = tp/(tp+fn) if tp+fn else 0
    f1 = 2*prec*rec/(prec+rec) if prec+rec else 0
    return prec, rec, f1

for txt in GOLD:
    p, r, f = prf_entity(GOLD[txt], PRED[txt])
    print("Entity P/R/F1:", round(p,3), round(r,3), round(f,3))

The model predicted Paris as ORG instead of LOC, causing a span-level mismatch.

Example 3: Summarization — ROUGE and human evaluation rubric

Automatic metrics like ROUGE correlate imperfectly with quality. Combine with a short human rubric.

# Quick ROUGE-1 recall approximation
from collections import Counter

def rouge1_recall(ref, hyp):
    ref_unigrams = Counter(ref.lower().split())
    hyp_unigrams = Counter(hyp.lower().split())
    overlap = sum((ref_unigrams & hyp_unigrams).values())
    total = sum(ref_unigrams.values())
    return overlap / total if total else 0

ref = "The committee approved the budget after revisions."
hyp = "The budget was approved by the committee."
print("ROUGE-1 recall:", round(rouge1_recall(ref, hyp), 3))

Human rubric (rate 1–5 each):

Faithfulness: no unsupported facts.
Coverage: important points included.
Clarity: easy to understand.

Report average scores and agreement between annotators.

Example 4: Robustness — add typos and measure accuracy drop

import random

def add_typos(text, p=0.05):
    chars = list(text)
    for i in range(len(chars)-1):
        if random.random() < p and chars[i].isalpha() and chars[i+1].isalpha():
            chars[i], chars[i+1] = chars[i+1], chars[i]
    return ''.join(chars)

texts = ["I loved the movie", "This product is terrible", "Great quality"]
labels = [1, 0, 1]

# Suppose we have a predict_proba() that returns class probabilities
# and predict() that returns 0/1. Replace with your model API.

def evaluate(texts, labels):
    # Dummy: pretend every positive
    preds = [1]*len(texts)
    acc = sum(int(p==y) for p,y in zip(preds,labels))/len(labels)
    return acc

acc_clean = evaluate(texts, labels)
acc_noisy = evaluate([add_typos(t, p=0.1) for t in texts], labels)
print("Accuracy clean vs noisy:", acc_clean, acc_noisy, "delta:", acc_noisy - acc_clean)

Track the delta. If the drop is large, add normalization or data augmentation.

Example 5: Fairness — TPR gap across groups

# Group by a simple heuristic (presence of pronouns as a proxy slice example)
from collections import defaultdict

docs = [
  ("He was satisfied with service", 1, 1),
  ("She disliked the delay", 0, 1),
  ("They are happy", 1, 1),
  ("She is not impressed", 0, 0),
]
# each tuple: (text, true_label, pred_label)

def group(text):
    t = text.lower()
    if " she " in " " + t + " ": return "she"
    if " he " in " " + t + " ": return "he"
    if " they " in " " + t + " ": return "they"
    return "other"

by_g = defaultdict(list)
for txt, y, yhat in docs:
    by_g[group(txt)].append((y, yhat))

def tpr(pairs):
    tp = sum(1 for y,yhat in pairs if y==1 and yhat==1)
    pos = sum(1 for y,_ in pairs if y==1)
    return tp/pos if pos else 0

for g, pairs in by_g.items():
    print(g, "TPR:", round(tpr(pairs),3))

Report the largest gap across groups and whether it is acceptable for your use case.

Drills and exercises

Compute micro and macro F1 for a 3-class dataset; explain which you would report and why.
Build a confusion matrix; list top 2 confusions and 1 hypothesis for each.
Create two slices (e.g., short vs long texts) and compare metrics; propose one fix.
Read 50 errors; tag themes (ambiguous label, OOD, truncation, bad label).
Design a 5-item human rubric for a generation task; pilot with 2 annotators and compute agreement.
Add a simple noise transform (typos or casing) and measure metric deltas.
Draft a 30–100 example regression test set with expected outputs.
Create a CSV log with model_version, data_version, metric(s), date; add one new row per run.

Common mistakes and debugging tips

Using the wrong metric

Detect: Stakeholders care about minority class, but you report accuracy. Fix: Report macro-F1 and per-class metrics; align with business goals.

Evaluating NER at token-level only

Detect: Token F1 looks high but spans look wrong. Fix: Report entity-level exact-match F1 and per-entity breakdown.

Ignoring slices

Detect: Overall metrics stable but complaints increase in a subgroup. Fix: Define slices by length, domain, language, demographics proxies where appropriate; compare gaps.

Skipping label audits

Detect: Many “errors” are actually mislabeled. Fix: Sample 50–100 items and re-annotate; compute estimated label error rate.

No regression tests

Detect: Previously fixed bugs reappear after an update. Fix: Add curated test cases into a permanent suite and block releases when failing.

One-off evaluation without versioning

Detect: Can’t reproduce last week’s numbers. Fix: Log model version, data version, seed, metrics, and date every run.

Mini project: Ship an evaluation suite for a sentiment model

Define metrics: macro-F1, per-class F1, and calibration (ECE).
Confusion + slices: At least 3 slices (short/long, contains negation, domain).
Qualitative pass: Read 100 errors; tag themes; audit 50 labels.
Robustness: Add typo noise and casing changes; report metric deltas.
Fairness: Compute TPR gap across at least 2 groups (only if appropriate and safe for your use case and data).
Regression tests: 60 curated examples with expected labels.
Tracking: Append a row to metrics.csv (model_version, data_version, macro_f1, date).

Deliverables checklist

evaluation_report.md with metrics and key findings.
errors.md with 5 recurring themes and example IDs.
regression/ folder with JSONL of test cases.
metrics.csv with at least two runs.

Regression test sets

Keep a compact, high-signal suite that covers previously broken cases and critical scenarios.

// regression.jsonl (one example per line)
{"text": "I wouldn't recommend it", "expected": 0}
{"text": "Absolutely fantastic!", "expected": 1}
{"text": "Not bad", "expected": 1}

Rule: If any regression test fails, stop and investigate before release.

Human evaluation setup

Rubric: 3–5 criteria, 5-point scales, clear definitions and examples.
Pilot: Run a small batch, compute agreement, refine guidelines.
Sampling: Stratify by slices (length, domain) to cover variety.
Aggregation: Report per-criterion means and confidence intervals.

Tracking model changes over time

Log every run with versions and metrics. A simple CSV is enough to start.

# metrics.csv
# date,model_version,data_version,macro_f1,acc,ece,notes
2026-01-01,m_v1,d_v1,0.823,0.870,0.045,baseline
2026-01-03,m_v2,d_v1,0.841,0.882,0.042,add bigram features

# Quick trend plot (pseudo; adapt to your stack)
import csv
rows = list(csv.DictReader(open('metrics.csv')))
rows = sorted(rows, key=lambda r: r['date'])
for r in rows:
    print(r['date'], r['model_version'], r['macro_f1'])

Define acceptance gates, e.g., “Do not ship if macro-F1 drops > 0.5% absolute or any regression test fails.”

Practical projects

Evaluate a multilingual intent classifier with macro-F1 and per-language slices. Add typos and script-mixing noise.
Design a small human evaluation for headline generation (faithfulness, clarity). Compare with ROUGE.
Build a fairness check for toxicity detection using group-specific keywords; report TPR/TNR gaps and mitigation ideas.

Next steps

Expand stress tests: paraphrase, emoji, punctuation, code-mixing.
Add calibration plots and thresholds optimized for business cost.
Introduce error-driven data augmentation pipelines.
Set up periodic evaluation jobs and dashboards for trend monitoring.

Subskills

Task Appropriate Metrics

Choose and justify metrics that match task and product goals (e.g., macro-F1 for imbalance, ROUGE+human ratings for generation).

Confusion And Slice Analysis

Use confusion matrices and per-slice metrics to pinpoint failure patterns and prioritize fixes.

Qualitative Review And Label Audits

Read errors, tag themes, and verify label quality to separate model issues from dataset problems.

Robustness To Noise

Create simple perturbations (typos, casing, punctuation) and measure metric deltas to ensure stability.

Bias And Fairness Checks Basics

Compute group metrics (e.g., TPR/TNR gaps) and report disparities with context and care.

Regression Test Sets

Build curated examples representing critical and previously broken behaviors; block releases on failures.

Human Evaluation Setup

Design clear rubrics, pilot annotation, measure agreement, and aggregate scores consistently.

Tracking Model Changes Over Time

Log model/data versions with metrics; monitor trends and enforce acceptance gates before deployment.

Skill exam

The exam is available to everyone. Logged-in learners get their progress saved. Aim for 70% to pass. You can retake it anytime.

Menu

NLP Evaluation And Error Analysis

Table of Contents