How to learn N Grams And Character Features for Feature Engineering For Classical NLP in NLP Engineer for free

Why this matters

N-grams and character features remain strong baselines for many NLP tasks, especially when data is limited or when you need fast, interpretable models. As an NLP Engineer, you will use them to:

Build quick, reliable baselines for sentiment, topic, or spam detection.
Handle noisy text (typos, slang, product codes) using character n-grams.
Support classic models (Logistic Regression, SVM, Naive Bayes) that train and iterate quickly.
Compare against deep models to prove value or justify complexity.

Who this is for

Early-career NLP/ML engineers building strong baselines.
Data scientists needing quick, interpretable text features.
Engineers working with noisy, short, or multilingual text.

Prerequisites

Basic Python and familiarity with text preprocessing (lowercasing, tokenization).
Understanding of classification tasks and train/validation/test splits.
Familiarity with vectorization concepts (bag-of-words, TF-IDF) is helpful but not required.

Concept explained simply

N-grams are sequences of N consecutive items. In text, those items can be words (word n-grams) or characters (character n-grams).

Word unigrams: single words (e.g., "great").
Word bigrams: pairs of consecutive words (e.g., "not great").
Character trigrams: sequences of 3 characters (e.g., "fan", "ant" from "fantastic").

Why use them?

Word bigrams capture local context ("not good" vs "good").
Character n-grams handle typos, inflections, hashtags, and unseen words.
They are fast to compute and work well with linear models.

Mental model

Imagine a sliding window moving across text. At each step, it records what's inside (a word pair or a character group). Then you count how often each windowed pattern appears. Optionally, you reweight counts with TF-IDF so patterns common across all documents matter less than patterns distinctive to a few.

Key feature types and knobs

Word n-grams: set ngram_range=(1,2) for unigrams+bigrams. Good for context and phrases.
Character n-grams: common ranges are (3,5). Robust to typos, prefixes/suffixes, and mixed scripts.
Weighting:
- Counts: raw frequency.
- Binary: presence/absence (useful when repetition doesn't add meaning).
- TF-IDF: downweights ubiquitous n-grams, highlights distinctive ones.
Vocabulary control: min_df to drop rare n-grams; max_df to drop very common ones.
Hashing trick: fixed-size feature space without storing a vocabulary. Collisions can happen but are usually tolerable with a large dimensionality.
Preprocessing: lowercasing, basic normalization, stopword handling; keep punctuation if it carries meaning (e.g., "!", emojis).

Edge cases and tips

Short texts: prefer character n-grams and binary weighting.
Noisy user input: character n-grams are often more robust than words.
Domain jargon/codes (e.g., SKUs): character n-grams shine.
Multilingual data: character n-grams often transfer better across languages.

Worked examples

Example 1 — Word unigrams and bigrams

Sentence: "not good at all but good service"

Tokens: [not, good, at, all, but, good, service]
Unigrams (counts): not:1, good:2, at:1, all:1, but:1, service:1
Bigrams (counts): "not good":1, "good at":1, "at all":1, "all but":1, "but good":1, "good service":1

Notice how bigrams differentiate "not good" from the positive word "good" alone.

Example 2 — Character trigrams

Text: "fantastic!" (lowercased)

Characters: f a n t a s t i c !
3-grams: fan, ant, nta, tas, ast, sti, tic, ic!

Even if a user types "fantstic!" by mistake, many trigrams (tas, ast, sti, tic) still overlap, helping the model recover meaning.

Example 3 — Tiny TF-IDF

Corpus (3 docs):

D1: "good service"
D2: "not good"
D3: "good price"

Unigrams: good, service, not, price

DF: good=3, service=1, not=1, price=1
IDF (log((N+1)/(DF+1))+1, with N=3):
- good: log((4/4))+1 = 1.0
- service: log((4/2))+1 ≈ 1.693
- not: log((4/2))+1 ≈ 1.693
- price: log((4/2))+1 ≈ 1.693
TF (raw counts). In D2: "not"=1, "good"=1 → TF-IDF for D2:
- good: 1 × 1.0 = 1.0
- not: 1 × 1.693 ≈ 1.693

"not" becomes more influential than the ubiquitous "good" after TF-IDF weighting.

Example 4 — Hashing trick intuition

Map each n-gram to an index via a hash function modulo a large dimension (e.g., 2^20). Two different n-grams may collide, but with a large space and sparse features, it rarely hurts performance. You avoid storing a growing vocabulary and keep memory predictable.

Practical steps (from text to model)

Step 1. Normalize text: lowercase, standardize whitespace. Keep punctuation/emojis if useful.

Step 2. Choose feature family:

Clean formal text → word n-grams (1–2).
Noisy/short text → character n-grams (3–5) or combine with word unigrams.

Step 3. Weighting: start with TF-IDF; consider binary for very short texts.

Step 4. Control vocabulary: set min_df to drop ultra-rare n-grams; max_df to drop stopword-like items.

Step 5. Train a linear model (Logistic Regression/SVM). Use regularization (C for LR/SVM) to handle high dimensionality.

Step 6. Evaluate with a validation split. Compare word vs char features and combinations.

Common mistakes and self-check

Mistake: Using only unigrams on short/noisy text. Fix: add char 3–5-grams.
Mistake: Letting the vocabulary explode. Fix: set min_df, limit ngram_range, or use hashing.
Mistake: Dropping punctuation/emojis that carry sentiment. Fix: keep them or extract as characters.
Mistake: Overfitting with too many features and high C. Fix: increase regularization or reduce features.
Mistake: Misinterpreting TF-IDF. Fix: remember it downweights across-corpus frequent terms.

Self-check prompts

Did validation metrics improve when adding bigrams or char n-grams?
Is the feature space size stable and memory-safe?
Do top weighted n-grams make sense for the task?

Exercises

Complete these exercises here, then compare with the solutions. The same tasks appear below the article so your progress can be saved if you are logged in.

Exercise 1 — Word bigrams with TF-IDF (paper exercise)

Corpus:

D1: "great product and great support"
D2: "not great product"
D3: "great price and support"

Task:

Use word n-grams with ngram_range=(1,2).
Compute TF-IDF using IDF = log((N+1)/(DF+1)) + 1, N=3.
For D2, list TF-IDF weights for: "great", "product", "not", and bigrams "not great", "great product" (round to 3 decimals).

Exercise 2 — Character 3-grams and robustness

Words: "battery", "battary", "batery" (lowercased). Extract char 3-grams for each and compute Jaccard similarity between:

battery vs battary
battery vs batery

Jaccard(A,B) = |A ∩ B| / |A ∪ B|. List the sets and the two similarity scores (round to 2 decimals).

Checklist before you submit:
- Used the provided IDF formula and rounding.
- Included both unigrams and bigrams where requested.
- For char 3-grams, counted only contiguous sequences.

Practical projects

Tweet sentiment baseline: compare unigram+bigram TF-IDF vs char 3–5-gram TF-IDF on short texts with emojis.
Spam filter for support tickets: combine word unigrams with char 4-grams; evaluate precision/recall on rare spam phrases.
Product category classifier: test hashing trick with 2^18 dimensions vs explicit vocabulary; report impact of collisions.

Mini challenge

You must classify app reviews as positive/negative. Data is noisy and short. You try three setups with Logistic Regression:

Word unigrams only, TF-IDF.
Word unigrams+bigrams, TF-IDF.
Character 3–5-grams, binary.

Predict which two will perform best and why. Then justify whether adding both word bigrams and char 3–5-grams could be complementary or redundant in this setting.

Learning path

Now: N-grams and character features (this lesson).
Next: Feature selection and dimensionality control (min_df/max_df, chi-square selection).
Then: Linguistic enrichments (lemmatization, POS tags as features) for classical models.
Finally: Compare with neural embeddings; learn when classical features win on speed/interpretability.

Next steps

Take the quick test below to lock in the concepts. Anyone can take it; only logged-in users get saved progress.
Apply to a small dataset you know. Aim for a strong, simple baseline first.

Menu

N Grams And Character Features

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Key feature types and knobs

Worked examples

Example 1 — Word unigrams and bigrams

Example 2 — Character trigrams

Example 3 — Tiny TF-IDF

Practical steps (from text to model)

Common mistakes and self-check

Exercises

Practical projects

Mini challenge

Learning path

Next steps

Practice Exercises

Word bigrams with TF-IDF (paper exercise)

Instructions

Expected Output

Character 3-grams and robustness

N Grams And Character Features — Quick Test

Have questions about N Grams And Character Features?

AI Assistant