How to learn Inter Annotator Agreement Basics for Text Data Collection And Labeling in NLP Engineer for free

Why this matters

Inter-annotator agreement (IAA) tells you how consistently humans apply your labeling rules. In NLP, poor IAA means noisy training data, weak models, and unreliable evaluations.

Decide if guidelines are clear enough before scaling labeling.
Compare human consistency with model performance goals.
Identify ambiguous classes and fix them early.
Track quality during large production labeling runs.

Concept explained simply

What is IAA?

IAA measures how often annotators independently agree on labels, beyond what would happen by chance. A high IAA means rules are clear and reproducible.

Mental model

Imagine each annotator as a camera photographing the same object. Agreement is how similar the photos are. Chance agreement is when two blurry photos just happen to look alike. IAA metrics discount this coincidence to reveal true clarity.

Core metrics at a glance

Percent agreement (Po): Simple ratio of agreements to total items. Fast, intuitive, no chance correction.
Cohen's kappa (κ): Two annotators, nominal (unordered) categories; adjusts for chance.
Weighted kappa: Two annotators, ordinal categories; partial credit for near-misses using linear or quadratic weights.
Fleiss' kappa: Three or more annotators, same number of ratings per item; chance-corrected.
Krippendorff's alpha (α): Flexible; supports any number of annotators, missing labels, and different data types (nominal, ordinal, interval). Great for real projects with uneven coverage.

Quick formulas (in words)

Percent agreement Po = (agreements) / (total items)
Kappa κ = (Po − Pe) / (1 − Pe), where Pe is chance agreement from label frequencies.
Weighted kappa replaces Po and Pe with their weighted versions.
Fleiss' kappa computes per-item agreement then averages; chance from overall category proportions.
Alpha uses distance-based disagreement; supports missing data elegantly.

Worked examples

Example 1: Percent agreement (binary)

10 items, two annotators label Spam vs Not Spam. They agree on 8 out of 10.

Show steps

Agreements = 8
Total items = 10
Percent agreement = 8/10 = 0.80 (80%)

Example 2: Cohen's kappa (2 annotators, nominal)

Confusion counts over 50 items (A vs B) for Positive/Negative:

A=Pos, B=Pos: 20
A=Pos, B=Neg: 5
A=Neg, B=Pos: 5
A=Neg, B=Neg: 20

Compute κ

Po = (20 + 20) / 50 = 40 / 50 = 0.80
Marginals: A(Pos)=25/50=0.5, A(Neg)=0.5; B(Pos)=0.5, B(Neg)=0.5
Pe = 0.5×0.5 + 0.5×0.5 = 0.50
κ = (0.80 − 0.50) / (1 − 0.50) = 0.30 / 0.50 = 0.60

Example 3: Fleiss' kappa (≥3 annotators)

3 annotators, 4 items, 3 categories (C1, C2, C3). Counts per item [C1, C2, C3]:

Item1: [3,0,0]
Item2: [0,3,0]
Item3: [0,2,1]
Item4: [2,1,0]

Compute κ (sketch)

Per-item agreement P: Item1=1, Item2=1, Item3=2/6≈0.333, Item4=2/6≈0.333
Average P̄ = (1+1+0.333+0.333)/4 ≈ 0.667
Category proportions p: C1=5/12≈0.417, C2=6/12=0.5, C3=1/12≈0.083
Pe = 0.417² + 0.5² + 0.083² ≈ 0.431
κ = (0.667 − 0.431) / (1 − 0.431) ≈ 0.415

Example 4: Weighted kappa (ordinal)

Two annotators rate 20 items on a 1–3 scale. Quadratic weights. Matrix A×B:

Row 1: [6, 2, 0]
Row 2: [1, 6, 1]
Row 3: [0, 2, 2]

Compute κ_w (summary)

Observed weighted agreement P_o^w = 0.925
Expected weighted agreement P_e^w ≈ 0.745
κ_w = (0.925 − 0.745) / (1 − 0.745) ≈ 0.706

How to plan an annotation round

Define task + labels: Keep classes mutually exclusive, collectively exhaustive. Add examples and counterexamples.
Pilot: Label 30–100 items with 2–3 annotators.
Measure IAA: Use Cohen's κ for two fixed annotators; Fleiss' κ or α for more/uneven coverage.
Calibrate: Discuss disagreements; refine rules; add tie-breakers and edge cases.
Re-run pilot: Aim for stable, interpretable agreement (see thresholds below).
Scale up: Monitor IAA on rolling samples; retrain annotators if drift appears.

Thresholds and interpretation

Heuristic bands (often cited): κ/α < 0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect.
Use domain judgment: medical/financial tasks may require ≥0.80; exploratory tasks can accept 0.60–0.75 to iterate quickly.
High prevalence or class imbalance can depress κ even with high raw agreement; check confusion patterns.

Exercises

Do these before the quick test. Anyone can take the test; if you log in, your progress is saved.

Exercise 1: Compute Cohen's kappa from a confusion table

Two annotators label 50 items as Positive/Negative. Counts:

A=Pos, B=Pos: 20
A=Pos, B=Neg: 5
A=Neg, B=Pos: 5
A=Neg, B=Neg: 20

Task: Calculate percent agreement and Cohen's κ. Round to 2 decimals.

Hints

Percent agreement is (diagonal sum) / total items.
Compute marginal proportions for each class and annotator.
Chance agreement Pe = sum over classes of (pA_class × pB_class).
κ = (Po − Pe) / (1 − Pe).

Show solution

Po = (20+20)/50 = 0.80; pA(Pos)=0.5, pA(Neg)=0.5; pB(Pos)=0.5, pB(Neg)=0.5; Pe = 0.5×0.5 + 0.5×0.5 = 0.50; κ = (0.80 − 0.50)/(1 − 0.50) = 0.60.

Self-check checklist

I computed Po and Pe separately.
I verified marginals sum to 1.
I can explain why Pe is high when a class dominates.

Common mistakes and how to self-check

Using percent agreement alone. Self-check: Also compute a chance-corrected metric (κ/α).
Ignoring label imbalance. Self-check: Inspect class distributions and confusion matrix; compare Po vs κ.
Wrong metric for setup. Self-check: Two annotators? Use Cohen/weighted κ. Variable annotators or missing labels? Use α.
Overfitting guidelines to a tiny pilot. Self-check: Validate on a fresh batch after revisions.
For ordinal labels, treating them as nominal. Self-check: Use weighted κ or α with appropriate distance function.
Not documenting edge cases. Self-check: After adjudication, record examples and final decisions.

Practical projects

Design a labeling guideline for 3-class sentiment, run a 50-item pilot with 3 annotators, and report percent agreement, weighted κ, and key disagreements.
Build a light QA workbook: per-annotator confusion patterns, rolling κ over time, and flagged items for adjudication.
Reconcile a low-κ dataset: propose guideline changes, re-run a pilot, and quantify improvement (before/after κ/α and error types).

Who this is for

NLP Engineers and MLEs preparing datasets.
Data annotators and label ops leads.
Product analysts designing human-in-the-loop systems.

Prerequisites

Basic probability and ratios.
Understanding of your labeling task and classes.
Ability to read a confusion matrix.

Learning path

Define labels and write concise guidelines with examples.
Run a small pilot with multiple annotators.
Compute IAA (percent agreement + κ/α) and diagnose disagreements.
Refine guidelines and retrain annotators.
Scale labeling with periodic IAA checks and adjudication.

Next steps

Implement weighted κ for your ordinal tasks.
Set up a weekly IAA review with annotators.
Document decision rules and add them to your guideline appendix.

Mini challenge

You have 4 annotators labeling product reviews into 5 sentiments (strongly negative to strongly positive), with some items labeled by only 2 or 3 annotators. Choose the best metric and explain why, then list two steps to improve IAA if it is low.

Ready? Take the quick test

The quick test is available to everyone; only logged-in users have their progress saved.

Menu

Inter Annotator Agreement Basics

Table of Contents