Who this is for
- NLP Engineers and ML practitioners who need accurate Named Entity Recognition (NER) in production.
- Data scientists moving from classic CRF/feature-based NER to transformer-based NER.
- Developers integrating PII redaction, resume parsing, or domain entity extraction (e.g., medical, legal).
Prerequisites
- Comfort with Python and PyTorch.
- Basic understanding of transformers (BERT/RoBERTa) and tokenization.
- Familiarity with classification loss (cross-entropy) and train/val/test splits.
Why this matters
Hyperparameters that usually work
- LR: 2e-5 to 5e-5; batch size 16–32 effective.
- Epochs: 3–5 (watch for overfitting on small datasets).
- Warmup: ~10%; weight decay: 0.01; gradient clipping: 1.0.
- Max length: 128–256; truncate carefully (or use sliding windows for long docs).
Imbalanced labels
- Rare entities: try upsampling, focal loss, or class-weighted loss.
- Don’t force balance if it hurts precision on frequent classes—validate.
Common mistakes and self-check
- Forgetting to ignore subword labels in loss: check that non-first subwords are -100.
- Mismatched label maps: ensure label2id/id2label are saved with the model.
- Using token-level accuracy instead of entity-level F1: verify metric implementation.
- Truncating entities at sequence boundary: consider sliding windows with overlap.
- Case-insensitive base model for NER: prefer cased models unless you validated otherwise.
Self-check checklist
- Your training loss excludes -100 positions.
- Validation uses entity-level F1 and matches manual spot checks.
- Confusions (e.g., LOC vs ORG) are visible in error analysis.
- You can reconstruct entities from predictions deterministically.
Practical projects
- Resume NER: fine-tune on a small labeled set for skills, roles, companies, dates; evaluate per-entity F1.
- Healthcare NER: extract medications and dosages; evaluate on a held-out clinical corpus (de-identified).
- PII redaction microservice: serve a NER model with thresholds and audit logs; measure latency and F1.
Exercises
Do these to lock in the core skills. Solutions are hidden below each task.
Exercise 1 — Label alignment with subwords
Given words and BIO labels:
Words: [John, lives, in, New, York, City, .]
Labels: [B-PER, O, O, B-LOC, I-LOC, I-LOC, O]
And tokenized tokens with special tokens included (assume no subword splits for simplicity):
Tokens: [CLS, John, lives, in, New, York, City, ., SEP]
word_ids: [None, 0, 1, 2, 3, 4, 5, 6, None]
Align labels per token using -100 for special tokens and non-first subwords.
Show solution
Aligned labels:
[-100, B-PER, O, O, B-LOC, I-LOC, I-LOC, O, -100]Since no subword splits occur, each token (except [CLS]/[SEP]) receives the word label directly.
Exercise 2 — From BIO tags to spans
Convert BIO tags to entity spans (start, end, type), end-exclusive indexing.
Tokens: [Barack, Obama, visited, Paris, .]
Tags: [B-PER, I-PER, O, B-LOC, O]
Expected spans?
Show solution
[(0, 2, 'PER'), (3, 4, 'LOC')]Explanation: Barack+Obama = PER from 0 to 2; Paris = LOC from 3 to 4.
Before you move on, check:
- You can produce word_ids and align labels with -100 correctly.
- You can reconstruct entity spans from BIO predictions.
- You understand how to compute entity-level F1.
Mini challenge
Train a small NER model on a subset of your data with two entity types (e.g., PER and ORG). Target at least 85% entity-level F1 on validation. Add a threshold or rule to reduce a common false positive you observe, and document the trade-off.
Learning path
- Refresh: tokenization and attention masks.
- Learn: label schemes and alignment; seqeval-style metrics.
- Practice: baseline fine-tune → PEFT (LoRA) → domain augmentation.
- Deploy: inference spans, offsets, logging, and monitoring.
Next steps
- Try BIOES for sharper boundaries; compare against BIO.
- Add confidence calibration and thresholds per entity type.
- Experiment with sliding windows for long documents and merge spans post-hoc.
Quick Test
Anyone can take the test. Only logged-in users will have their progress saved.