Why this matters
As an NLP Engineer, your model is only as good as the labels it learns from. A clear task definition and label schema reduces ambiguity, speeds up annotation, and leads to higher model accuracy and easier evaluation.
- Product: Prioritize features using sentiment and intent labels from user feedback.
- Support: Route tickets with multi-label intents (billing, technical, cancellation).
- Risk: Flag toxic or PII content using consistent categories and span rules.
- Search: Improve relevance with entity labels and query intent types.
Concept explained simply
Task definition says what the model should do. The label schema tells how to categorize or mark text so the model can learn.
Mental model
Think of your task like a map. The destination is your business outcome. The label schema is the legend: it defines symbols (labels), boundaries (rules), and rare cases (edge decisions). If the legend is fuzzy, everyone reads the map differently.
Common NLP task types and label schemas
- Single-label classification: One best label per item (e.g., sentiment: positive/neutral/negative).
- Multi-label classification: Multiple labels can apply (e.g., toxicity categories: insult, threat, sexual, identity hate).
- Ordinal classification: Labels have order (e.g., star ratings 1–5). Metrics should respect order.
- Sequence labeling: Mark tokens/spans with BIO/BILOU tags (e.g., PERSON, ORG). Decide tokenization and valid span rules.
- Span extraction/Q&A: Extract answer spans; define what counts as an answer vs no-answer.
- Pairwise/relationship tasks: Determine if two texts are duplicates, entailment, or linked by a relation.
- Generation with constrained tags: Free text output plus required tags; define exact required tags and validation rules.
Quality criteria for label schemas
- Mutually exclusive where intended: If single-label, only one should reasonably fit.
- Collectively exhaustive: Include an Other/Unknown when coverage is incomplete.
- Operational: Clear definitions, decision rules, examples, and counter-examples.
- Learnable: Distinctions observable in text; not requiring hidden knowledge.
- Balanced enough: Avoid extreme class imbalance when possible; merge or reweight if needed.
- Measurable: Each label maps to metrics you can compute (precision/recall per class, span-level F1, etc.).
Quick checklist
- Does every real example fit one or more labels?
- Would two annotators agree using the guide?
- Can you score the model for each label?
- Are edge cases documented with final decisions?
Step-by-step: define a label schema
- Clarify outcome: What decision will this model support? How will success be measured?
- Choose task type: Classification, sequence labeling, span extraction, pairwise, or generation.
- Draft labels: Start broad; ensure coverage. Prefer fewer, clearer labels over many fuzzy ones.
- Write label definitions: One-sentence purpose, inclusion/exclusion rules, 3–5 examples each.
- Decide structural rules: Single vs multi-label, BIO/BILOU format, allowed overlaps, hierarchy.
- Edge case policy: Ambiguous sentiment, sarcasm, nested entities, emojis, URLs, code, multilingual.
- Pilot and revise: Annotate 50–100 samples; measure agreement; refine unclear parts.
Annotation guide template
For each label:
- Name and short definition
- Inclusion rules
- Exclusion rules
- Positive examples (3–5)
- Near-miss examples (2–3) and what to choose instead
- Notes on punctuation, emojis, hashtags, code, or other artifacts
Worked examples
Example 1: App review sentiment (single-label classification)
Goal: Track user satisfaction.
- Labels: Positive, Neutral, Negative, Other-language/Unreadable
- Rules: Sarcasm uses literal sentiment unless clear negative context. Mixed sentiment: choose most prominent.
- Examples:
- "Love the new dark mode!" → Positive
- "App is okay, nothing special." → Neutral
- "Crashes every time I open it." → Negative
- "Excelente aplicación" (non-target language) → Other-language/Unreadable
Example 2: NER for job postings (sequence labeling, BILOU)
Goal: Extract ROLE, ORG, LOC, SKILL.
- Labels: ROLE, ORG, LOC, SKILL using BILOU scheme.
- Rules: Hyphenated skills stay in one span. Company suffixes (Inc., LLC) included in ORG. Remote is not LOC unless a city/region is named.
- Text: "Senior Data Scientist at Acme Inc. in Berlin, Python required."
- [U-ROLE Senior Data Scientist] at [U-ORG Acme Inc.] in [U-LOC Berlin], [U-SKILL Python] required.
Example 3: Support chatbot intents (multi-label classification)
Goal: Route tickets automatically.
- Labels: Billing, Technical Issue, Cancellation, Account Access, Abuse/Spam
- Rules: Multi-label allowed. If multiple intents appear, select all; if message is abusive, always add Abuse/Spam.
- Text: "I can't log in and please cancel my plan" → Account Access + Cancellation
Exercises
Complete these, then check your work. These mirror the exercises list on this page.
Exercise 1 — Sentiment schema and edge cases
Create a single-label sentiment schema for mobile app reviews. Define labels and rules, then assign labels to these texts:
- 1) "Works fine, but drains battery fast."
- 2) "Finally fixed! Awesome update."
- 3) "meh"
- 4) "App keeps freezing after login"
- 5) "Excelente app!" (non-target language)
Hints
- Include an Other/Unreadable label.
- Document how to treat mixed sentiment.
Show solution
See the solution in the Exercises section below.
Exercise 2 — NER span rules
Design a BILOU schema for job postings with ROLE, ORG, LOC, and SKILL. Annotate the sentence: "Hiring Lead ML Engineer at BrightFuture LLC, remote from Toronto, strong PyTorch skills."
Hints
- Include company suffixes (LLC) inside ORG.
- Treat city names as LOC. "Remote" alone is not a LOC.
Show solution
See the solution in the Exercises section below.
Common mistakes and how to self-check
- Too many labels: Merge rarely used labels or move them under Other until volume justifies.
- Vague definitions: Add inclusion/exclusion rules and counter-examples.
- No edge-case policy: List at least 5 recurring tricky cases and final decisions.
- Ignoring class imbalance: Cap sampling or reweight; track per-class metrics.
- Inconsistent span boundaries: Specify punctuation, hyphens, and suffix rules; use BIO/BILOU consistently.
- Skipping pilot: Always run a small pilot and measure agreement before full labeling.
Self-check mini-audit
- ☐ Can two people independently reach the same label?
- ☐ Do you have at least 3 positive and 2 near-miss examples per label?
- ☐ Are span boundaries unambiguous across examples?
- ☐ Does every label map to a metric you will track?
Practical projects
- Product feedback classifier: Build a 4-label sentiment and topic schema; annotate 300 reviews; train a baseline and report per-class F1.
- Job posting NER: Define ROLE, ORG, LOC, SKILL with BILOU; annotate 200 sentences; evaluate span-level F1.
- Toxicity moderation: Multi-label schema (insult, threat, profanity, sexual); annotate 250 comments; measure micro/macro F1 and analyze confusion.
Who this is for and prerequisites
- Who: NLP Engineers, Data Scientists, Analysts defining labeling projects.
- Prerequisites: Basic NLP concepts (tokens, spans), evaluation metrics (precision/recall/F1), and comfort reading guidelines.
Learning path
- Define outcome and task type.
- Draft labels and rules using the template.
- Pilot with 50–100 samples and measure agreement.
- Refine and finalize schema.
- Scale labeling and set up quality checks.
Mini challenge
Given: "This update fixed the bug, but notifications are still delayed." Create a single-label sentiment decision with your rules. Write a one-sentence justification.
Tip
Choose the most prominent sentiment expressed and note mixed-sentiment handling.
Next steps
- Run a 50-sample pilot with your schema.
- Measure inter-annotator agreement (e.g., Cohen's kappa for classification, span-level agreement for NER).
- Iterate definitions where disagreement is high.
Quick Test is available to everyone; log in to save your progress.