Why this matters
N-grams and character features remain strong baselines for many NLP tasks, especially when data is limited or when you need fast, interpretable models. As an NLP Engineer, you will use them to:
- Build quick, reliable baselines for sentiment, topic, or spam detection.
- Handle noisy text (typos, slang, product codes) using character n-grams.
- Support classic models (Logistic Regression, SVM, Naive Bayes) that train and iterate quickly.
- Compare against deep models to prove value or justify complexity.
Who this is for
- Early-career NLP/ML engineers building strong baselines.
- Data scientists needing quick, interpretable text features.
- Engineers working with noisy, short, or multilingual text.
Prerequisites
- Basic Python and familiarity with text preprocessing (lowercasing, tokenization).
- Understanding of classification tasks and train/validation/test splits.
- Familiarity with vectorization concepts (bag-of-words, TF-IDF) is helpful but not required.
Concept explained simply
N-grams are sequences of N consecutive items. In text, those items can be words (word n-grams) or characters (character n-grams).
- Word unigrams: single words (e.g., "great").
- Word bigrams: pairs of consecutive words (e.g., "not great").
- Character trigrams: sequences of 3 characters (e.g., "fan", "ant" from "fantastic").
Why use them?
- Word bigrams capture local context ("not good" vs "good").
- Character n-grams handle typos, inflections, hashtags, and unseen words.
- They are fast to compute and work well with linear models.
Mental model
Imagine a sliding window moving across text. At each step, it records what's inside (a word pair or a character group). Then you count how often each windowed pattern appears. Optionally, you reweight counts with TF-IDF so patterns common across all documents matter less than patterns distinctive to a few.
Key feature types and knobs
- Word n-grams: set
ngram_range=(1,2)for unigrams+bigrams. Good for context and phrases. - Character n-grams: common ranges are
(3,5). Robust to typos, prefixes/suffixes, and mixed scripts. - Weighting:
- Counts: raw frequency.
- Binary: presence/absence (useful when repetition doesn't add meaning).
- TF-IDF: downweights ubiquitous n-grams, highlights distinctive ones.
- Vocabulary control:
min_dfto drop rare n-grams;max_dfto drop very common ones. - Hashing trick: fixed-size feature space without storing a vocabulary. Collisions can happen but are usually tolerable with a large dimensionality.
- Preprocessing: lowercasing, basic normalization, stopword handling; keep punctuation if it carries meaning (e.g., "!", emojis).
Edge cases and tips
- Short texts: prefer character n-grams and binary weighting.
- Noisy user input: character n-grams are often more robust than words.
- Domain jargon/codes (e.g., SKUs): character n-grams shine.
- Multilingual data: character n-grams often transfer better across languages.
Worked examples
Example 1 β Word unigrams and bigrams
Sentence: "not good at all but good service"
- Tokens: [not, good, at, all, but, good, service]
- Unigrams (counts): not:1, good:2, at:1, all:1, but:1, service:1
- Bigrams (counts): "not good":1, "good at":1, "at all":1, "all but":1, "but good":1, "good service":1
Notice how bigrams differentiate "not good" from the positive word "good" alone.
Example 2 β Character trigrams
Text: "fantastic!" (lowercased)
- Characters: f a n t a s t i c !
- 3-grams: fan, ant, nta, tas, ast, sti, tic, ic!
Even if a user types "fantstic!" by mistake, many trigrams (tas, ast, sti, tic) still overlap, helping the model recover meaning.
Example 3 β Tiny TF-IDF
Corpus (3 docs):
D1: "good service"
D2: "not good"
D3: "good price"
Unigrams: good, service, not, price
- DF: good=3, service=1, not=1, price=1
- IDF (log((N+1)/(DF+1))+1, with N=3):
- good: log((4/4))+1 = 1.0
- service: log((4/2))+1 β 1.693
- not: log((4/2))+1 β 1.693
- price: log((4/2))+1 β 1.693
- TF (raw counts). In D2: "not"=1, "good"=1 β TF-IDF for D2:
- good: 1 Γ 1.0 = 1.0
- not: 1 Γ 1.693 β 1.693
"not" becomes more influential than the ubiquitous "good" after TF-IDF weighting.
Example 4 β Hashing trick intuition
Map each n-gram to an index via a hash function modulo a large dimension (e.g., 2^20). Two different n-grams may collide, but with a large space and sparse features, it rarely hurts performance. You avoid storing a growing vocabulary and keep memory predictable.
Practical steps (from text to model)
- Clean formal text β word n-grams (1β2).
- Noisy/short text β character n-grams (3β5) or combine with word unigrams.
min_df to drop ultra-rare n-grams; max_df to drop stopword-like items.Common mistakes and self-check
- Mistake: Using only unigrams on short/noisy text. Fix: add char 3β5-grams.
- Mistake: Letting the vocabulary explode. Fix: set
min_df, limitngram_range, or use hashing. - Mistake: Dropping punctuation/emojis that carry sentiment. Fix: keep them or extract as characters.
- Mistake: Overfitting with too many features and high C. Fix: increase regularization or reduce features.
- Mistake: Misinterpreting TF-IDF. Fix: remember it downweights across-corpus frequent terms.
Self-check prompts
- Did validation metrics improve when adding bigrams or char n-grams?
- Is the feature space size stable and memory-safe?
- Do top weighted n-grams make sense for the task?
Exercises
Complete these exercises here, then compare with the solutions. The same tasks appear below the article so your progress can be saved if you are logged in.
Exercise 1 β Word bigrams with TF-IDF (paper exercise)
Corpus:
D1: "great product and great support"
D2: "not great product"
D3: "great price and support"
Task:
- Use word n-grams with
ngram_range=(1,2). - Compute TF-IDF using IDF = log((N+1)/(DF+1)) + 1, N=3.
- For D2, list TF-IDF weights for: "great", "product", "not", and bigrams "not great", "great product" (round to 3 decimals).
Exercise 2 β Character 3-grams and robustness
Words: "battery", "battary", "batery" (lowercased). Extract char 3-grams for each and compute Jaccard similarity between:
- battery vs battary
- battery vs batery
Jaccard(A,B) = |A β© B| / |A βͺ B|. List the sets and the two similarity scores (round to 2 decimals).
- Checklist before you submit:
- Used the provided IDF formula and rounding.
- Included both unigrams and bigrams where requested.
- For char 3-grams, counted only contiguous sequences.
Practical projects
- Tweet sentiment baseline: compare unigram+bigram TF-IDF vs char 3β5-gram TF-IDF on short texts with emojis.
- Spam filter for support tickets: combine word unigrams with char 4-grams; evaluate precision/recall on rare spam phrases.
- Product category classifier: test hashing trick with 2^18 dimensions vs explicit vocabulary; report impact of collisions.
Mini challenge
You must classify app reviews as positive/negative. Data is noisy and short. You try three setups with Logistic Regression:
- Word unigrams only, TF-IDF.
- Word unigrams+bigrams, TF-IDF.
- Character 3β5-grams, binary.
Predict which two will perform best and why. Then justify whether adding both word bigrams and char 3β5-grams could be complementary or redundant in this setting.
Learning path
- Now: N-grams and character features (this lesson).
- Next: Feature selection and dimensionality control (min_df/max_df, chi-square selection).
- Then: Linguistic enrichments (lemmatization, POS tags as features) for classical models.
- Finally: Compare with neural embeddings; learn when classical features win on speed/interpretability.
Next steps
- Take the quick test below to lock in the concepts. Anyone can take it; only logged-in users get saved progress.
- Apply to a small dataset you know. Aim for a strong, simple baseline first.