Who this is for
Applied Scientists, Data Scientists, and ML Engineers who need reliable labels for training, evaluating, and monitoring ML systems. Also useful for product managers and annotator leads who define labeling rules.
Prerequisites
- Basic understanding of classification, detection, or sequence labeling tasks.
- Familiarity with precision/recall and dataset splits.
- Some experience reviewing annotated data (even small samples).
Learning path
- Define the problem and user outcome.
- Design a label taxonomy (classes, hierarchy, constraints).
- Draft clear labeling guidelines (definitions, edge cases, examples).
- Pilot with 50β200 items, measure inter-annotator agreement (IAA), revise.
- Operationalize QA: gold checks, spot audits, versioning.
- Monitor drift and iterate.
Why this matters
In real projects you will:
- Decide if a task is multi-class vs. multi-label and what the valid choices are.
- Write guidelines so multiple annotators label consistently.
- Handle ambiguous items with an "Uncertain" or "Abstain" class.
- Track taxonomy versions so model metrics remain comparable over time.
Good taxonomy and guidelines increase label consistency, boost model performance, reduce rework, and cut costs.
Concept explained simply
Label taxonomy = the set of labels your model can predict and the rules that relate them. Labeling guidelines = the instructions for humans (or heuristics) to apply those labels in a consistent way.
Think of the taxonomy as the "menu" and the guidelines as the "recipe" for each item on the menu.
Mental model
- Contract: The taxonomy and guidelines are a contract between data, model, and evaluators.
- Entropy reducer: Each rule reduces uncertainty and disagreement.
- Versioned API: Changing labels is a breaking change; version and communicate it.
Key elements of a solid label taxonomy
- Task type: multi-class (one label), multi-label (many), regression + bins, sequence, detection.
- Granularity: enough detail to be useful, but simple enough for consistent labeling.
- Constraints: mutually exclusive sets, hierarchical rules, dependencies.
- Coverage: include catch-all (Other/Unknown) and uncertainty/abstain.
- Measurability: clear definitions that allow reliable agreement.
- Versioning: ID, date, change log.
Guidelines structure (recommended)
- Purpose and success criteria.
- Label list with short definitions.
- Decision rules and tie-breakers.
- Positive/negative examples per label.
- Ambiguity policy: when to use Uncertain/Abstain and how to flag issues.
- QA policy: gold questions, double-labeling rate, IAA target, audit process.
- Version header: taxonomy version, last updated, owner.
Worked examples
Example 1 β Sentiment for product reviews (multi-class)
Taxonomy: Positive, Neutral, Negative, Uncertain.
- Mutual exclusivity: choose exactly one, unless Uncertain.
- Decision rule: if both praise and complaint, prefer the strongest expressed emotion; if balanced, Neutral.
- Edge case: sarcasm β Uncertain unless clear.
Guideline snippet:
- Positive: clear praise without major complaints. Example: "Love the battery life!"
- Neutral: facts or mixed views with equal weight. Example: "Battery is ok, screen is ok."
- Negative: complaint, frustration. Example: "Battery dies in 2 hours."
- Uncertain: language unclear, off-topic, or sarcasm not resolvable.
Example 2 β Content moderation (multi-label)
Taxonomy: Spam, Harassment, Adult/NSFW, Hate, Safe.
- Multi-label: multiple harmful categories can co-occur; Safe must not co-occur with any harm category.
- Decision rule: If any harm present β do not select Safe.
- Ambiguity: sexual health education β do not label Adult/NSFW; mark Safe.
QA: 10% gold items with known answers; Cohen's kappa target β₯ 0.7.
Example 3 β Object detection for vehicles (bounding boxes)
Taxonomy: Car, Bus, Truck, Motorcycle, Bicycle, UnknownVehicle.
- Constraints: one class per box; min box size 20Γ20 px; occlusion allowed if β₯ 30% visible.
- Edge rule: vans β Truck; scooters β Motorcycle; ambiguous β UnknownVehicle.
- Box policy: tight around visible extents; no padding; overlapping allowed if different objects.
Example 4 β Intent classification for support tickets
Taxonomy (flat, multi-class): Billing_Issue, Cancel_Subscription, Technical_Bug, Feature_Request, Account_Access, Other, Uncertain.
- Tie-breaker: If a ticket clearly requests cancellation, choose Cancel_Subscription even if billing is mentioned.
- Other: on-topic but not in the defined set; Uncertain: not enough information.
- Sampling: include at least 10β20 examples per label in the guideline appendix.
Quality, agreement, and iteration
- Pilot: label 100β200 items with 2β3 annotators each.
- Measure IAA: Cohen's kappa (two annotators) or Krippendorff's alpha (multiple). Target β₯ 0.7; revise if lower.
- Analyze disagreements: add rules or examples where confusion is highest.
- Gold checks: embed known-answer items; review annotator-specific confusion.
- Versioning: increment version and note changes; keep old-to-new mapping for metrics continuity.
Practical templates
Taxonomy header template
Taxonomy name: Customer Support Intents
Version: v1.2
Owner: Applied Science Team
Last updated: 2026-01-07
Task type: Multi-class
Labels: [Billing_Issue, Cancel_Subscription, Technical_Bug, Feature_Request, Account_Access, Other, Uncertain]
Constraints: Exactly one label except Uncertain; Cancel_Subscription overrides Billing_Issue
Guideline rule template
Label: Technical_Bug
Definition: Customer reports product malfunction or error messages.
Positive examples: "App crashes on login", "Error 503 when saving"
Negative examples: "How do I use feature X?" (not a bug)
Tie-breaker: If both request and bug, pick Technical_Bug if error is blocking.
Abstain: If not enough info to determine if error exists.
Exercises
Do these with a small sample (15β30 items) to pressure-test your design.
Exercise 1 β Design a flat taxonomy for support intents
Create a multi-class taxonomy for a SaaS support inbox. Include 6β8 labels, a catch-all, and Uncertain. Write three tie-breaker rules and three examples per label.
- Deliverable: taxonomy list, short definitions, rules, and examples.
Exercise 2 β Draft moderation guidelines with QA
Define multi-label categories: Spam, Harassment, Adult/NSFW, Hate, Safe. Add edge-case guidance, 2β3 gold questions per label, and an IAA target with a plan to measure it on a 100-item pilot.
Post-task checklist
- Each label has a clear definition and 2+ positive and negative examples
- Constraints (mutual exclusivity, overrides) are stated
- Catch-all and Uncertain/Abstain policies are defined
- You have a pilot plan with IAA metric and threshold
- Version header and change log are included
Common mistakes and how to self-check
- Too many labels: collapses agreement. Self-check: can two annotators apply it in under 10 seconds per item?
- No catch-all: forces wrong labels. Self-check: sample ambiguous items; do they fit somewhere sensible?
- Undefined tie-breakers: creates chaos. Self-check: list top 5 confusions; add explicit rules.
- Skipping Uncertain/Abstain: encourages guessing. Self-check: measure Uncertain rate; target 3β10% on pilots.
- No versioning: metric drift. Self-check: does the header include version, date, and mapping notes?
- Insufficient examples: rules feel abstract. Self-check: 10β20 examples per label, with diverse edge cases.
Practical projects
- Build a v1 taxonomy and guidelines for your productβs top user-facing ML task.
- Run a 150-item double-annotated pilot, compute IAA, and iterate to v1.1.
- Create a gold set of 50 items covering every label and tricky edge case.
- Ship a labeling QA playbook: sampling strategy, audit frequency, escalation path.
Next steps
- Integrate your taxonomy into labeling tools and enforce constraints.
- Establish a quarterly review of drift and taxonomy updates.
- Document mapping if you ever merge/split labels to preserve historical metrics.
Mini challenge
Pick any public forum thread or app reviews (20 items). Apply your taxonomy. Track disagreements with a peer and propose two new rules that would have prevented them.
Quick Test
Everyone can take the quick test for free. Only logged-in users get saved progress.