luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Label Taxonomy And Guidelines

Learn Label Taxonomy And Guidelines for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Who this is for

Applied Scientists, Data Scientists, and ML Engineers who need reliable labels for training, evaluating, and monitoring ML systems. Also useful for product managers and annotator leads who define labeling rules.

Prerequisites

  • Basic understanding of classification, detection, or sequence labeling tasks.
  • Familiarity with precision/recall and dataset splits.
  • Some experience reviewing annotated data (even small samples).

Learning path

  1. Define the problem and user outcome.
  2. Design a label taxonomy (classes, hierarchy, constraints).
  3. Draft clear labeling guidelines (definitions, edge cases, examples).
  4. Pilot with 50–200 items, measure inter-annotator agreement (IAA), revise.
  5. Operationalize QA: gold checks, spot audits, versioning.
  6. Monitor drift and iterate.

Why this matters

In real projects you will:

  • Decide if a task is multi-class vs. multi-label and what the valid choices are.
  • Write guidelines so multiple annotators label consistently.
  • Handle ambiguous items with an "Uncertain" or "Abstain" class.
  • Track taxonomy versions so model metrics remain comparable over time.

Good taxonomy and guidelines increase label consistency, boost model performance, reduce rework, and cut costs.

Concept explained simply

Label taxonomy = the set of labels your model can predict and the rules that relate them. Labeling guidelines = the instructions for humans (or heuristics) to apply those labels in a consistent way.

Think of the taxonomy as the "menu" and the guidelines as the "recipe" for each item on the menu.

Mental model

  • Contract: The taxonomy and guidelines are a contract between data, model, and evaluators.
  • Entropy reducer: Each rule reduces uncertainty and disagreement.
  • Versioned API: Changing labels is a breaking change; version and communicate it.

Key elements of a solid label taxonomy

  • Task type: multi-class (one label), multi-label (many), regression + bins, sequence, detection.
  • Granularity: enough detail to be useful, but simple enough for consistent labeling.
  • Constraints: mutually exclusive sets, hierarchical rules, dependencies.
  • Coverage: include catch-all (Other/Unknown) and uncertainty/abstain.
  • Measurability: clear definitions that allow reliable agreement.
  • Versioning: ID, date, change log.

Guidelines structure (recommended)

  1. Purpose and success criteria.
  2. Label list with short definitions.
  3. Decision rules and tie-breakers.
  4. Positive/negative examples per label.
  5. Ambiguity policy: when to use Uncertain/Abstain and how to flag issues.
  6. QA policy: gold questions, double-labeling rate, IAA target, audit process.
  7. Version header: taxonomy version, last updated, owner.

Worked examples

Example 1 β€” Sentiment for product reviews (multi-class)

Taxonomy: Positive, Neutral, Negative, Uncertain.

  • Mutual exclusivity: choose exactly one, unless Uncertain.
  • Decision rule: if both praise and complaint, prefer the strongest expressed emotion; if balanced, Neutral.
  • Edge case: sarcasm β†’ Uncertain unless clear.

Guideline snippet:

  • Positive: clear praise without major complaints. Example: "Love the battery life!"
  • Neutral: facts or mixed views with equal weight. Example: "Battery is ok, screen is ok."
  • Negative: complaint, frustration. Example: "Battery dies in 2 hours."
  • Uncertain: language unclear, off-topic, or sarcasm not resolvable.
Example 2 β€” Content moderation (multi-label)

Taxonomy: Spam, Harassment, Adult/NSFW, Hate, Safe.

  • Multi-label: multiple harmful categories can co-occur; Safe must not co-occur with any harm category.
  • Decision rule: If any harm present β†’ do not select Safe.
  • Ambiguity: sexual health education β†’ do not label Adult/NSFW; mark Safe.

QA: 10% gold items with known answers; Cohen's kappa target β‰₯ 0.7.

Example 3 β€” Object detection for vehicles (bounding boxes)

Taxonomy: Car, Bus, Truck, Motorcycle, Bicycle, UnknownVehicle.

  • Constraints: one class per box; min box size 20Γ—20 px; occlusion allowed if β‰₯ 30% visible.
  • Edge rule: vans β†’ Truck; scooters β†’ Motorcycle; ambiguous β†’ UnknownVehicle.
  • Box policy: tight around visible extents; no padding; overlapping allowed if different objects.
Example 4 β€” Intent classification for support tickets

Taxonomy (flat, multi-class): Billing_Issue, Cancel_Subscription, Technical_Bug, Feature_Request, Account_Access, Other, Uncertain.

  • Tie-breaker: If a ticket clearly requests cancellation, choose Cancel_Subscription even if billing is mentioned.
  • Other: on-topic but not in the defined set; Uncertain: not enough information.
  • Sampling: include at least 10–20 examples per label in the guideline appendix.

Quality, agreement, and iteration

  • Pilot: label 100–200 items with 2–3 annotators each.
  • Measure IAA: Cohen's kappa (two annotators) or Krippendorff's alpha (multiple). Target β‰₯ 0.7; revise if lower.
  • Analyze disagreements: add rules or examples where confusion is highest.
  • Gold checks: embed known-answer items; review annotator-specific confusion.
  • Versioning: increment version and note changes; keep old-to-new mapping for metrics continuity.

Practical templates

Taxonomy header template
Taxonomy name: Customer Support Intents
Version: v1.2
Owner: Applied Science Team
Last updated: 2026-01-07
Task type: Multi-class
Labels: [Billing_Issue, Cancel_Subscription, Technical_Bug, Feature_Request, Account_Access, Other, Uncertain]
Constraints: Exactly one label except Uncertain; Cancel_Subscription overrides Billing_Issue
    
Guideline rule template
Label: Technical_Bug
Definition: Customer reports product malfunction or error messages.
Positive examples: "App crashes on login", "Error 503 when saving"
Negative examples: "How do I use feature X?" (not a bug)
Tie-breaker: If both request and bug, pick Technical_Bug if error is blocking.
Abstain: If not enough info to determine if error exists.
    

Exercises

Do these with a small sample (15–30 items) to pressure-test your design.

Exercise 1 β€” Design a flat taxonomy for support intents

Create a multi-class taxonomy for a SaaS support inbox. Include 6–8 labels, a catch-all, and Uncertain. Write three tie-breaker rules and three examples per label.

  • Deliverable: taxonomy list, short definitions, rules, and examples.

Exercise 2 β€” Draft moderation guidelines with QA

Define multi-label categories: Spam, Harassment, Adult/NSFW, Hate, Safe. Add edge-case guidance, 2–3 gold questions per label, and an IAA target with a plan to measure it on a 100-item pilot.

Post-task checklist

  • Each label has a clear definition and 2+ positive and negative examples
  • Constraints (mutual exclusivity, overrides) are stated
  • Catch-all and Uncertain/Abstain policies are defined
  • You have a pilot plan with IAA metric and threshold
  • Version header and change log are included

Common mistakes and how to self-check

  • Too many labels: collapses agreement. Self-check: can two annotators apply it in under 10 seconds per item?
  • No catch-all: forces wrong labels. Self-check: sample ambiguous items; do they fit somewhere sensible?
  • Undefined tie-breakers: creates chaos. Self-check: list top 5 confusions; add explicit rules.
  • Skipping Uncertain/Abstain: encourages guessing. Self-check: measure Uncertain rate; target 3–10% on pilots.
  • No versioning: metric drift. Self-check: does the header include version, date, and mapping notes?
  • Insufficient examples: rules feel abstract. Self-check: 10–20 examples per label, with diverse edge cases.

Practical projects

  • Build a v1 taxonomy and guidelines for your product’s top user-facing ML task.
  • Run a 150-item double-annotated pilot, compute IAA, and iterate to v1.1.
  • Create a gold set of 50 items covering every label and tricky edge case.
  • Ship a labeling QA playbook: sampling strategy, audit frequency, escalation path.

Next steps

  • Integrate your taxonomy into labeling tools and enforce constraints.
  • Establish a quarterly review of drift and taxonomy updates.
  • Document mapping if you ever merge/split labels to preserve historical metrics.

Mini challenge

Pick any public forum thread or app reviews (20 items). Apply your taxonomy. Track disagreements with a peer and propose two new rules that would have prevented them.

Quick Test

Everyone can take the quick test for free. Only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Create a multi-class taxonomy for a SaaS support inbox with 6–8 labels, plus Other and Uncertain. Provide:

  • Label list with one-sentence definitions
  • Three tie-breaker rules
  • Three positive and three negative examples per label
Expected Output
A concise taxonomy (8–10 total labels including Other and Uncertain), clear definitions, rules that resolve common conflicts (e.g., cancel vs. billing), and examples that reflect real tickets.

Label Taxonomy And Guidelines β€” Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Label Taxonomy And Guidelines?

AI Assistant

Ask questions about this tool