Why this matters
As an NLP Engineer, you decide how raw text becomes model inputs. Good vocabulary and subword choices reduce out-of-vocabulary issues, shorten sequences, lower memory, and improve accuracy. Real tasks include training tokenizers for new domains, choosing vocab size for latency budgets, and handling user text with typos, emojis, and multiple languages.
- Build a tokenizer for a chatbot that understands slang and misspellings.
- Fine-tune a model on medical notes without exploding vocabulary size.
- Deploy a model that must handle Unicode, emojis, and rare names consistently.
Concept explained simply
Words are split into smaller pieces called subwords. A vocabulary is the set of pieces your model knows. With subwords, even unseen words can be built from known pieces (like 'photo' + '##graphy').
Mental model
Think of subwords as Lego bricks. A large brick set (big vocabulary) builds sentences with fewer pieces (shorter sequences) but costs more memory. A small set (small vocabulary) is cheaper but needs more pieces (longer sequences). Your job is to choose the right brick set and how to snap pieces together.
Core concepts
- Token: a piece of text your model sees (word, subword, byte, or char).
- Vocabulary: the finite list of tokens with integer IDs (includes special tokens like [CLS], [SEP], [PAD], [UNK]).
- Out-of-vocabulary (OOV): a string not directly in the vocab. Subwords minimize OOV by composing words from pieces.
- Normalization: lowercasing, Unicode normalization, removing accents, handling punctuation, etc.
- Pretokenization: splitting on spaces or punctuation before subword training.
- Subword algorithms: BPE (byte pair encoding), WordPiece, Unigram LM. All learn frequent segments; details differ.
- Byte-level BPE: operates on bytes; robust to any Unicode (emojis, scripts), virtually no OOV.
- Casing: cased vs uncased vocab; task-dependent choice.
Worked examples
Example 1: WordPiece-style segmentation
Suppose vocabulary includes: 'un', '##believ', '##able', '##ly', 'help', '##ful'.
Input: unbelievably helpful Tokens: un ##believ ##able ##ly help ##ful
Why this works
The algorithm matches the longest possible known piece and then continues. Unknown parts are broken into smaller known pieces. If nothing matches, it uses [UNK].
Example 2: Tiny BPE merges
Corpus: 'low lower lowest new newer newest'. Start as characters. Frequent merges might be:
1) l + o -> lo 2) lo + w -> low 3) n + e -> ne 4) ne + w -> new 5) e + s -> es Now: lowest -> low es t newest -> new es t
Note
BPE learns merges by frequency across the corpus. With different corpora, merges differ.
Example 3: Map tokens to IDs
Given IDs: [UNK]=0, [CLS]=101, [SEP]=102, 'un'=2001, '##believ'=2002, '##able'=2003, '##ly'=2006, 'help'=2007, '##ful'=2008, 'people'=2009, '##s'=2010, '##ness'=2013, '.'=2011, '!'=2012.
Text: 'Unbelievably helpful!' Normalize: lowercase Tokens: [CLS] un ##believ ##able ##ly help ##ful ! [SEP] IDs: 101 2001 2002 2003 2006 2007 2008 2012 102
Detokenization tip
Special join rules (like removing '##' and merging) reconstruct text without extra spaces.
Designing a vocabulary
- Vocabulary size trade-off:
- Smaller vocab: longer sequences, smaller embedding matrix, better generalization to rare forms.
- Larger vocab: shorter sequences, larger memory/params, faster per-token but heavier model.
- Language/script:
- Space-delimited languages (English): subwords work well at 20k–50k pieces.
- Morphologically rich (Turkish): lean smaller pieces to capture affixes.
- Scripts without spaces (Chinese, Japanese): character or byte-level helps; learn merges on characters/bytes.
- Normalization: decide on lowercasing, accent folding, Unicode form. In domains where case matters (NER, German nouns), keep cased.
- Special tokens: include [PAD], [UNK], [CLS]/[BOS], [SEP]/[EOS], [MASK] if needed.
- Robustness: byte-level BPE avoids OOV for emojis and rare symbols.
Practical steps to build/use subwords
Use representative domain data. Remove obvious leaks (e.g., evaluation sets).
Decide on casing; split on spaces/punctuation as appropriate.
Choose BPE/WordPiece/Unigram; set target vocab size (e.g., 32k).
Check meaningful stems/suffixes; ensure special tokens exist.
Try noisy text, rare words, emojis, multiple languages.
Average tokens per sentence, OOV rate (ideally near 0), memory footprint of embeddings.
Common mistakes and self-check
- Training tokenizer on evaluation data. Self-check: Confirm held-out sets are excluded.
- Overly large vocabulary (e.g., 100k+) bloating memory. Self-check: compute embedding params (vocab_size x hidden_dim).
- Ignoring casing when case is informative. Self-check: sample NER outputs with cased vs uncased.
- Too aggressive normalization deleting useful symbols. Self-check: audit before/after examples with emojis, currency, math signs.
- Not handling [UNK]. Self-check: ensure truly unknown bytes map to a safe token only as last resort.
Exercises
Try these by hand. Use the hints if you get stuck.
Exercise 1 — Hand-train 5 BPE merges
Corpus: 'low lower lowest new newer newest'. Start from characters (including word boundary spaces). Perform 5 merges you think are most frequent, then tokenize 'lowest' and 'newest' using your merges.
Hints
- Focus on 'lo' 'low' and 'ne' 'new' patterns.
- Merging 'e' + 's' helps 'est'.
Show solution
Possible merges: 1) l+o -> lo 2) lo+w -> low 3) n+e -> ne 4) ne+w -> new 5) e+s -> es Tokenization: lowest -> low es t newest -> new es t
Exercise 2 — WordPiece tokenization to IDs
Assume normalization: lowercase and keep punctuation. Vocab IDs: [UNK]=0, [CLS]=101, [SEP]=102, 'un'=2001, '##believ'=2002, '##able'=2003, 'amazing'=2005, '##ly'=2006, 'help'=2007, '##ful'=2008, 'people'=2009, '##s'=2010, '##ness'=2013, '.'=2011, '!':2012.
- Tokenize 'Unbelievably helpful!'
- Tokenize 'People's amazing helpfulness.'
Hints
- Use longest-match-first.
- 'people's' becomes people ##s.
- 'unbelievably' becomes un ##believ ##able ##ly.
- 'helpfulness' becomes help ##ful ##ness.
Show solution
'Unbelievably helpful!' Tokens: [CLS] un ##believ ##able ##ly help ##ful ! [SEP] IDs: 101 2001 2002 2003 2006 2007 2008 2012 102 'People's amazing helpfulness.' Tokens: [CLS] people ##s amazing help ##ful ##ness . [SEP] IDs: 101 2009 2010 2005 2007 2008 2013 2011 102
Checklist
- I can explain why subwords reduce OOV.
- I can describe trade-offs of vocabulary size.
- I can manually segment a word using a given vocab.
- I can map tokens to IDs including special tokens.
Mini challenge
You need a tokenizer for a mobile on-device NLU model with 64MB total memory and need short latency. Hidden size is fixed; every +16k vocabulary adds noticeable memory. Pick a target vocab size and normalization strategy, and justify in 3 sentences. Consider emojis and multilingual inputs.
Practical projects
- Train two tokenizers (16k vs 32k) on a domain corpus and compare: avg tokens per sentence, embedding size, and downstream F1 on a small classifier.
- Build a byte-level BPE tokenizer and test it on mixed-language text with emojis; report OOV rate and typical segment lengths.
- Create a diagnostic script that highlights tokens mapped to [UNK] and suggests merges to reduce them.
Learning path
- Next: Text normalization strategies and Unicode handling.
- Then: Embeddings and how token IDs become vectors.
- After: Positional encodings and transformer inputs.
Who this is for
- Aspiring NLP Engineers learning tokenizer design.
- ML practitioners fine-tuning language models on new domains.
- Data scientists shipping text models to production.
Prerequisites
- Basic Python and ML familiarity.
- Understanding of embeddings and sequence models at a high level.
- Comfort with reading simple algorithmic steps.
Next steps
- Finish the quick test below to confirm understanding.
- Apply a tokenizer to your own dataset and measure sequence lengths.
- Iterate on vocab size and normalization based on metrics and errors.
Quick Test
The test is available to everyone. Only logged-in users have their progress saved.