luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Inter Annotator Agreement Basics

Learn Inter Annotator Agreement Basics for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Inter-annotator agreement (IAA) tells you how consistently humans apply your labeling rules. In NLP, poor IAA means noisy training data, weak models, and unreliable evaluations.

  • Decide if guidelines are clear enough before scaling labeling.
  • Compare human consistency with model performance goals.
  • Identify ambiguous classes and fix them early.
  • Track quality during large production labeling runs.

Concept explained simply

What is IAA?

IAA measures how often annotators independently agree on labels, beyond what would happen by chance. A high IAA means rules are clear and reproducible.

Mental model

Imagine each annotator as a camera photographing the same object. Agreement is how similar the photos are. Chance agreement is when two blurry photos just happen to look alike. IAA metrics discount this coincidence to reveal true clarity.

More on chance agreement

If one class is very common, two annotators can agree a lot just by frequently choosing it. Chance-corrected metrics (like kappa and alpha) adjust for this.

Core metrics at a glance

  • Percent agreement (Po): Simple ratio of agreements to total items. Fast, intuitive, no chance correction.
  • Cohen's kappa (κ): Two annotators, nominal (unordered) categories; adjusts for chance.
  • Weighted kappa: Two annotators, ordinal categories; partial credit for near-misses using linear or quadratic weights.
  • Fleiss' kappa: Three or more annotators, same number of ratings per item; chance-corrected.
  • Krippendorff's alpha (α): Flexible; supports any number of annotators, missing labels, and different data types (nominal, ordinal, interval). Great for real projects with uneven coverage.
Quick formulas (in words)
  • Percent agreement Po = (agreements) / (total items)
  • Kappa κ = (Po − Pe) / (1 − Pe), where Pe is chance agreement from label frequencies.
  • Weighted kappa replaces Po and Pe with their weighted versions.
  • Fleiss' kappa computes per-item agreement then averages; chance from overall category proportions.
  • Alpha uses distance-based disagreement; supports missing data elegantly.

Worked examples

Example 1: Percent agreement (binary)

10 items, two annotators label Spam vs Not Spam. They agree on 8 out of 10.

Show steps
  1. Agreements = 8
  2. Total items = 10
  3. Percent agreement = 8/10 = 0.80 (80%)

Example 2: Cohen's kappa (2 annotators, nominal)

Confusion counts over 50 items (A vs B) for Positive/Negative:

  • A=Pos, B=Pos: 20
  • A=Pos, B=Neg: 5
  • A=Neg, B=Pos: 5
  • A=Neg, B=Neg: 20
Compute κ
  1. Po = (20 + 20) / 50 = 40 / 50 = 0.80
  2. Marginals: A(Pos)=25/50=0.5, A(Neg)=0.5; B(Pos)=0.5, B(Neg)=0.5
  3. Pe = 0.5×0.5 + 0.5×0.5 = 0.50
  4. κ = (0.80 − 0.50) / (1 − 0.50) = 0.30 / 0.50 = 0.60

Example 3: Fleiss' kappa (≥3 annotators)

3 annotators, 4 items, 3 categories (C1, C2, C3). Counts per item [C1, C2, C3]:

  • Item1: [3,0,0]
  • Item2: [0,3,0]
  • Item3: [0,2,1]
  • Item4: [2,1,0]
Compute κ (sketch)
  1. Per-item agreement P: Item1=1, Item2=1, Item3=2/6≈0.333, Item4=2/6≈0.333
  2. Average P̄ = (1+1+0.333+0.333)/4 ≈ 0.667
  3. Category proportions p: C1=5/12≈0.417, C2=6/12=0.5, C3=1/12≈0.083
  4. Pe = 0.417² + 0.5² + 0.083² ≈ 0.431
  5. κ = (0.667 − 0.431) / (1 − 0.431) ≈ 0.415

Example 4: Weighted kappa (ordinal)

Two annotators rate 20 items on a 1–3 scale. Quadratic weights. Matrix A×B:

  • Row 1: [6, 2, 0]
  • Row 2: [1, 6, 1]
  • Row 3: [0, 2, 2]
Compute κw (summary)
  1. Observed weighted agreement Pow = 0.925
  2. Expected weighted agreement Pew ≈ 0.745
  3. κw = (0.925 − 0.745) / (1 − 0.745) ≈ 0.706

How to plan an annotation round

  1. Define task + labels: Keep classes mutually exclusive, collectively exhaustive. Add examples and counterexamples.
  2. Pilot: Label 30–100 items with 2–3 annotators.
  3. Measure IAA: Use Cohen's κ for two fixed annotators; Fleiss' κ or α for more/uneven coverage.
  4. Calibrate: Discuss disagreements; refine rules; add tie-breakers and edge cases.
  5. Re-run pilot: Aim for stable, interpretable agreement (see thresholds below).
  6. Scale up: Monitor IAA on rolling samples; retrain annotators if drift appears.

Thresholds and interpretation

  • Heuristic bands (often cited): κ/α < 0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect.
  • Use domain judgment: medical/financial tasks may require ≥0.80; exploratory tasks can accept 0.60–0.75 to iterate quickly.
  • High prevalence or class imbalance can depress κ even with high raw agreement; check confusion patterns.

Exercises

Do these before the quick test. Anyone can take the test; if you log in, your progress is saved.

Exercise 1: Compute Cohen's kappa from a confusion table

Two annotators label 50 items as Positive/Negative. Counts:

  • A=Pos, B=Pos: 20
  • A=Pos, B=Neg: 5
  • A=Neg, B=Pos: 5
  • A=Neg, B=Neg: 20

Task: Calculate percent agreement and Cohen's κ. Round to 2 decimals.

Hints
  • Percent agreement is (diagonal sum) / total items.
  • Compute marginal proportions for each class and annotator.
  • Chance agreement Pe = sum over classes of (pA_class × pB_class).
  • κ = (Po − Pe) / (1 − Pe).
Show solution

Po = (20+20)/50 = 0.80; pA(Pos)=0.5, pA(Neg)=0.5; pB(Pos)=0.5, pB(Neg)=0.5; Pe = 0.5×0.5 + 0.5×0.5 = 0.50; κ = (0.80 − 0.50)/(1 − 0.50) = 0.60.

Self-check checklist

  • I computed Po and Pe separately.
  • I verified marginals sum to 1.
  • I can explain why Pe is high when a class dominates.

Common mistakes and how to self-check

  • Using percent agreement alone. Self-check: Also compute a chance-corrected metric (κ/α).
  • Ignoring label imbalance. Self-check: Inspect class distributions and confusion matrix; compare Po vs κ.
  • Wrong metric for setup. Self-check: Two annotators? Use Cohen/weighted κ. Variable annotators or missing labels? Use α.
  • Overfitting guidelines to a tiny pilot. Self-check: Validate on a fresh batch after revisions.
  • For ordinal labels, treating them as nominal. Self-check: Use weighted κ or α with appropriate distance function.
  • Not documenting edge cases. Self-check: After adjudication, record examples and final decisions.

Practical projects

  • Design a labeling guideline for 3-class sentiment, run a 50-item pilot with 3 annotators, and report percent agreement, weighted κ, and key disagreements.
  • Build a light QA workbook: per-annotator confusion patterns, rolling κ over time, and flagged items for adjudication.
  • Reconcile a low-κ dataset: propose guideline changes, re-run a pilot, and quantify improvement (before/after κ/α and error types).

Who this is for

  • NLP Engineers and MLEs preparing datasets.
  • Data annotators and label ops leads.
  • Product analysts designing human-in-the-loop systems.

Prerequisites

  • Basic probability and ratios.
  • Understanding of your labeling task and classes.
  • Ability to read a confusion matrix.

Learning path

  1. Define labels and write concise guidelines with examples.
  2. Run a small pilot with multiple annotators.
  3. Compute IAA (percent agreement + κ/α) and diagnose disagreements.
  4. Refine guidelines and retrain annotators.
  5. Scale labeling with periodic IAA checks and adjudication.

Next steps

  • Implement weighted κ for your ordinal tasks.
  • Set up a weekly IAA review with annotators.
  • Document decision rules and add them to your guideline appendix.

Mini challenge

You have 4 annotators labeling product reviews into 5 sentiments (strongly negative to strongly positive), with some items labeled by only 2 or 3 annotators. Choose the best metric and explain why, then list two steps to improve IAA if it is low.

Suggested answer

Use Krippendorff's alpha with ordinal distance: handles variable annotators and ordered labels. Improve IAA by clarifying boundary cases (e.g., neutral vs slightly positive) and running a calibration session with adjudication examples.

Ready? Take the quick test

The quick test is available to everyone; only logged-in users have their progress saved.

Practice Exercises

1 exercises to complete

Instructions

Two annotators label 50 items as Positive/Negative. Counts:

  • A=Pos, B=Pos: 20
  • A=Pos, B=Neg: 5
  • A=Neg, B=Pos: 5
  • A=Neg, B=Neg: 20

Task: Calculate percent agreement and Cohen's κ. Round to 2 decimals.

Expected Output
Percent agreement = 0.80; Cohen's kappa ≈ 0.60

Inter Annotator Agreement Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Inter Annotator Agreement Basics?

AI Assistant

Ask questions about this tool