Why this matters
Most real NLP datasets are imbalanced: a few rare but important classes (toxic, fraud, self-harm, critical incidents) appear far less than the majority class. As an NLP Engineer, you will:
- Collect and label data so rare classes are actually present in training.
- Choose sampling and annotation strategies that surface rare examples early.
- Evaluate with metrics that reflect real risk and cost.
- Deploy thresholds that meet recall/precision targets for the minority class.
Concept explained simply
Imbalance means one class overwhelms the others. A naive model can get high accuracy by always predicting the majority class but be useless for the rare class.
Mental model
Think of a crowded room (majority) and a quiet corner (minority). If you only look where it’s loud, you miss the quiet voices. Handling imbalance is about two things:
- Exposure: ensure the model sees enough minority examples.
- Attention: make the learning process care about them (weights/thresholds/metrics).
Core toolkit
Data-level strategies (before and during labeling)
- Stratified seeding: start with rules/heuristics to pull likely minority examples (keywords, regex, simple classifier) into the first labeling batches.
- Source quotas: if data comes from multiple channels, set per-source quotas to avoid overfilling with easy majority sources.
- Deduplication: remove near-duplicates so oversampling doesn’t create leakage between train/validation.
- Active learning: periodically train a simple model; sample uncertain or diverse items to label next; cap per-class to avoid drift.
- Minority-aware batching: ensure each labeler batch contains some likely minority candidates to maintain rater attention and consistency.
Data-level strategies (after labeling)
- Oversampling: repeat minority examples in the training loader. Do not oversample validation/test sets.
- Undersampling: downsample majority only if you can afford losing data; pair with multiple random undersamples.
- Augmentation for text: back-translation, synonym/phrase replacement, template-based generation. Keep label semantics intact; review a sample for quality.
Model/learning strategies
- Class-weighted loss: weight inversely to class frequency or via effective number of samples. Useful for cross-entropy.
- Focal loss: reduces loss on easy negatives so the model focuses on hard, rare positives.
- Threshold moving: choose a decision threshold to meet precision/recall targets on a validation set.
- Calibration: use Platt scaling or temperature scaling on a validation set so probabilities are meaningful.
Evaluation strategies
- Use stratified splits or group-aware splits with stratification where relevant.
- Prefer PR-AUC, macro F1, per-class precision/recall over accuracy. ROC-AUC can look high even when minority performance is poor.
- Report class-wise confusion matrices and cost-sensitive metrics if mistakes have different costs.
Worked examples
Example 1 — Toxic comment detection (rare positive)
Data: 50,000 comments; 4% toxic. Goal: catch toxic content with recall ≥ 0.85 while keeping precision ≥ 0.6.
- Split: stratified 80/10/10 with deduplication.
- Train: class-weighted cross-entropy, slight minority oversampling (2x).
- Tune threshold on validation to achieve recall 0.85 and precision ≥ 0.6.
- Report: PR-AUC, macro F1, per-class recall/precision, confusion matrix.
Why this works
Weights and oversampling improve attention to toxic examples; threshold tuning enforces the operational target.
Example 2 — Intent classification in support (many classes, some tiny)
Data: 30 intents; 5 rare intents have < 1% each.
- Data plan: per-intent quotas during labeling; use weak rules to find candidates for rare intents.
- Model: label smoothing + class weights; macro F1 for early stopping.
- Augment: template-based paraphrases for rare intents.
Outcome
Macro F1 improves because rare intents get decent recall, not just the popular ones.
Example 3 — NER with rare entity type
Data: General news NER with a new, rare entity type (e.g., DRUG).
- Sampling: mine sentences using domain cues (drug lists, suffixes), then human-verify.
- Training: upweight spans of DRUG; optionally focal loss.
- Evaluation: per-entity F1; report DRUG F1 separately; use document-level stratified splits.
Note
Entity-level imbalance requires both example-level and token/span-level handling.
Step-by-step playbook
- Quantify imbalance: compute class counts and minority ratios.
- Choose data strategy: combine stratified seeding + quotas + dedupe.
- Start labeling in waves: include likely minority items each wave; monitor per-class counts.
- Train baseline: class-weighted loss; avoid oversampling in validation/test.
- Tune threshold: sweep thresholds; pick one matching your precision/recall target.
- Report right metrics: PR-AUC, macro F1, per-class metrics, confusion matrices.
- Iterate with active learning: pull uncertain items, maintain diversity and per-source caps.
Common mistakes and self-check
- Mistake: Oversampling in validation. Self-check: Verify val/test are untouched and stratified.
- Mistake: Reporting only accuracy/ROC-AUC. Self-check: Always include PR-AUC and per-class metrics.
- Mistake: Ignoring duplicates/leakage. Self-check: Deduplicate and group-aware split before training.
- Mistake: Over-augmenting with label drift. Self-check: Manually review a random sample of augmented data.
- Mistake: One global threshold for all classes in multi-label when costs differ. Self-check: Consider per-class thresholds.
Exercises
These mirror the interactive exercises below. Try here first, then compare with the solutions.
Exercise 1 — Compute class weights
You have a binary dataset: 18,400 non-toxic (class 0) and 1,600 toxic (class 1). Compute inverse-frequency class weights suitable for cross-entropy and pick two evaluation metrics.
- Assume weights w_c = N / (K * n_c), where N = total samples, K = number of classes, n_c = class count.
- Round weights to 2 decimals.
Show a nudge
N=20,000, K=2. Compute w_0 and w_1 and suggest PR-AUC + macro F1.
Exercise 2 — Labeling plan under budget
You have 20,000 unlabeled support tickets from three sources: Email (60%), Chat (30%), Social (10%). The rare class "fraud" is ~2% overall but under-represented in Email. Budget: 2,000 labels. Design a first labeling wave (1,000 items) that increases the odds of finding fraud, reduces duplicates, and keeps data diverse. Provide numeric quotas per source and describe selection criteria.
Hint
Use source quotas, weak-rule seeding for likely fraud, deduplication, and uncertainty after a small seed model.
Practical projects
- Toxicity classifier: collect, stratify, label in waves with active learning; target recall ≥ 0.85 at precision ≥ 0.6.
- Rare intent detection: build paraphrase augmentation pipeline; report macro F1 and per-intent F1.
- NER for rare entities: mine candidates, annotate spans, upweight rare entity in loss; report per-entity F1.
Learning path
- Start: Dataset exploration and deduplication.
- Then: Stratified sampling and labeling guidelines for minority classes.
- Next: Active learning loops and augmentation for text.
- Finally: Threshold tuning, calibration, and cost-sensitive evaluation.
Who this is for
- NLP Engineers and Data Scientists preparing datasets for classification, multi-label tagging, or NER where rare classes matter.
Prerequisites
- Basic probability and classification metrics (precision, recall, F1).
- Familiarity with train/validation/test splits and avoiding data leakage.
Next steps
- Add calibration for stable thresholds over time.
- Monitor class distribution shift and re-run active learning when drift occurs.
- Document class-specific costs to guide metric and threshold choices.
Mini challenge
Given a 1% positive rate abuse-detection dataset, design a three-wave labeling plan (each wave 1,000 items) that doubles positive coverage by wave 2 without harming diversity. List sources, selection rules, and how you will measure success after each wave.
Quick test
The quick test is available to everyone. Only logged-in users will have their progress saved.