How to learn Vocabulary And Subwords for NLP Foundations in NLP Engineer for free

Why this matters

As an NLP Engineer, you decide how raw text becomes model inputs. Good vocabulary and subword choices reduce out-of-vocabulary issues, shorten sequences, lower memory, and improve accuracy. Real tasks include training tokenizers for new domains, choosing vocab size for latency budgets, and handling user text with typos, emojis, and multiple languages.

Build a tokenizer for a chatbot that understands slang and misspellings.
Fine-tune a model on medical notes without exploding vocabulary size.
Deploy a model that must handle Unicode, emojis, and rare names consistently.

Concept explained simply

Words are split into smaller pieces called subwords. A vocabulary is the set of pieces your model knows. With subwords, even unseen words can be built from known pieces (like 'photo' + '##graphy').

Mental model

Think of subwords as Lego bricks. A large brick set (big vocabulary) builds sentences with fewer pieces (shorter sequences) but costs more memory. A small set (small vocabulary) is cheaper but needs more pieces (longer sequences). Your job is to choose the right brick set and how to snap pieces together.

Core concepts

Token: a piece of text your model sees (word, subword, byte, or char).
Vocabulary: the finite list of tokens with integer IDs (includes special tokens like [CLS], [SEP], [PAD], [UNK]).
Out-of-vocabulary (OOV): a string not directly in the vocab. Subwords minimize OOV by composing words from pieces.
Normalization: lowercasing, Unicode normalization, removing accents, handling punctuation, etc.
Pretokenization: splitting on spaces or punctuation before subword training.
Subword algorithms: BPE (byte pair encoding), WordPiece, Unigram LM. All learn frequent segments; details differ.
Byte-level BPE: operates on bytes; robust to any Unicode (emojis, scripts), virtually no OOV.
Casing: cased vs uncased vocab; task-dependent choice.

Worked examples

Example 1: WordPiece-style segmentation

Suppose vocabulary includes: 'un', '##believ', '##able', '##ly', 'help', '##ful'.

Input: unbelievably helpful
Tokens: un ##believ ##able ##ly   help ##ful

Why this works

The algorithm matches the longest possible known piece and then continues. Unknown parts are broken into smaller known pieces. If nothing matches, it uses [UNK].

Example 2: Tiny BPE merges

Corpus: 'low lower lowest new newer newest'. Start as characters. Frequent merges might be:

1) l + o -> lo
2) lo + w -> low
3) n + e -> ne
4) ne + w -> new
5) e + s -> es
Now:
lowest -> low es t
newest -> new es t

Note

BPE learns merges by frequency across the corpus. With different corpora, merges differ.

Example 3: Map tokens to IDs

Given IDs: [UNK]=0, [CLS]=101, [SEP]=102, 'un'=2001, '##believ'=2002, '##able'=2003, '##ly'=2006, 'help'=2007, '##ful'=2008, 'people'=2009, '##s'=2010, '##ness'=2013, '.'=2011, '!'=2012.

Text: 'Unbelievably helpful!'
Normalize: lowercase
Tokens: [CLS] un ##believ ##able ##ly help ##ful ! [SEP]
IDs:    101   2001 2002     2003    2006 2007 2008 2012 102

Detokenization tip

Special join rules (like removing '##' and merging) reconstruct text without extra spaces.

Designing a vocabulary

Vocabulary size trade-off:
- Smaller vocab: longer sequences, smaller embedding matrix, better generalization to rare forms.
- Larger vocab: shorter sequences, larger memory/params, faster per-token but heavier model.
Language/script:
- Space-delimited languages (English): subwords work well at 20k–50k pieces.
- Morphologically rich (Turkish): lean smaller pieces to capture affixes.
- Scripts without spaces (Chinese, Japanese): character or byte-level helps; learn merges on characters/bytes.
Normalization: decide on lowercasing, accent folding, Unicode form. In domains where case matters (NER, German nouns), keep cased.
Special tokens: include [PAD], [UNK], [CLS]/[BOS], [SEP]/[EOS], [MASK] if needed.
Robustness: byte-level BPE avoids OOV for emojis and rare symbols.

Practical steps to build/use subwords

1) Collect text
Use representative domain data. Remove obvious leaks (e.g., evaluation sets).

2) Normalize and pretokenize
Decide on casing; split on spaces/punctuation as appropriate.

3) Train subword model
Choose BPE/WordPiece/Unigram; set target vocab size (e.g., 32k).

4) Inspect pieces
Check meaningful stems/suffixes; ensure special tokens exist.

5) Tokenize samples
Try noisy text, rare words, emojis, multiple languages.

6) Measure
Average tokens per sentence, OOV rate (ideally near 0), memory footprint of embeddings.

Common mistakes and self-check

Training tokenizer on evaluation data. Self-check: Confirm held-out sets are excluded.
Overly large vocabulary (e.g., 100k+) bloating memory. Self-check: compute embedding params (vocab_size x hidden_dim).
Ignoring casing when case is informative. Self-check: sample NER outputs with cased vs uncased.
Too aggressive normalization deleting useful symbols. Self-check: audit before/after examples with emojis, currency, math signs.
Not handling [UNK]. Self-check: ensure truly unknown bytes map to a safe token only as last resort.

Exercises

Try these by hand. Use the hints if you get stuck.

Exercise 1 — Hand-train 5 BPE merges

Corpus: 'low lower lowest new newer newest'. Start from characters (including word boundary spaces). Perform 5 merges you think are most frequent, then tokenize 'lowest' and 'newest' using your merges.

Hints

Focus on 'lo' 'low' and 'ne' 'new' patterns.
Merging 'e' + 's' helps 'est'.

Show solution

Possible merges:
1) l+o -> lo
2) lo+w -> low
3) n+e -> ne
4) ne+w -> new
5) e+s -> es
Tokenization:
lowest -> low es t
newest -> new es t

Exercise 2 — WordPiece tokenization to IDs

Assume normalization: lowercase and keep punctuation. Vocab IDs: [UNK]=0, [CLS]=101, [SEP]=102, 'un'=2001, '##believ'=2002, '##able'=2003, 'amazing'=2005, '##ly'=2006, 'help'=2007, '##ful'=2008, 'people'=2009, '##s'=2010, '##ness'=2013, '.'=2011, '!':2012.

Tokenize 'Unbelievably helpful!'
Tokenize 'People's amazing helpfulness.'

Hints

Use longest-match-first.
'people's' becomes people ##s.
'unbelievably' becomes un ##believ ##able ##ly.
'helpfulness' becomes help ##ful ##ness.

Show solution

'Unbelievably helpful!'
Tokens: [CLS] un ##believ ##able ##ly help ##ful ! [SEP]
IDs:    101   2001 2002     2003    2006 2007 2008 2012 102

'People's amazing helpfulness.'
Tokens: [CLS] people ##s amazing help ##ful ##ness . [SEP]
IDs:    101   2009   2010 2005    2007 2008  2013  2011 102

Checklist

I can explain why subwords reduce OOV.
I can describe trade-offs of vocabulary size.
I can manually segment a word using a given vocab.
I can map tokens to IDs including special tokens.

Mini challenge

You need a tokenizer for a mobile on-device NLU model with 64MB total memory and need short latency. Hidden size is fixed; every +16k vocabulary adds noticeable memory. Pick a target vocab size and normalization strategy, and justify in 3 sentences. Consider emojis and multilingual inputs.

Practical projects

Train two tokenizers (16k vs 32k) on a domain corpus and compare: avg tokens per sentence, embedding size, and downstream F1 on a small classifier.
Build a byte-level BPE tokenizer and test it on mixed-language text with emojis; report OOV rate and typical segment lengths.
Create a diagnostic script that highlights tokens mapped to [UNK] and suggests merges to reduce them.

Learning path

Next: Text normalization strategies and Unicode handling.
Then: Embeddings and how token IDs become vectors.
After: Positional encodings and transformer inputs.

Who this is for

Aspiring NLP Engineers learning tokenizer design.
ML practitioners fine-tuning language models on new domains.
Data scientists shipping text models to production.

Prerequisites

Basic Python and ML familiarity.
Understanding of embeddings and sequence models at a high level.
Comfort with reading simple algorithmic steps.

Next steps

Finish the quick test below to confirm understanding.
Apply a tokenizer to your own dataset and measure sequence lengths.
Iterate on vocab size and normalization based on metrics and errors.

Quick Test

The test is available to everyone. Only logged-in users have their progress saved.

Menu

Vocabulary And Subwords

Table of Contents