luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

N Grams And Character Features

Learn N Grams And Character Features for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

N-grams and character features remain strong baselines for many NLP tasks, especially when data is limited or when you need fast, interpretable models. As an NLP Engineer, you will use them to:

  • Build quick, reliable baselines for sentiment, topic, or spam detection.
  • Handle noisy text (typos, slang, product codes) using character n-grams.
  • Support classic models (Logistic Regression, SVM, Naive Bayes) that train and iterate quickly.
  • Compare against deep models to prove value or justify complexity.

Who this is for

  • Early-career NLP/ML engineers building strong baselines.
  • Data scientists needing quick, interpretable text features.
  • Engineers working with noisy, short, or multilingual text.

Prerequisites

  • Basic Python and familiarity with text preprocessing (lowercasing, tokenization).
  • Understanding of classification tasks and train/validation/test splits.
  • Familiarity with vectorization concepts (bag-of-words, TF-IDF) is helpful but not required.

Concept explained simply

N-grams are sequences of N consecutive items. In text, those items can be words (word n-grams) or characters (character n-grams).

  • Word unigrams: single words (e.g., "great").
  • Word bigrams: pairs of consecutive words (e.g., "not great").
  • Character trigrams: sequences of 3 characters (e.g., "fan", "ant" from "fantastic").

Why use them?

  • Word bigrams capture local context ("not good" vs "good").
  • Character n-grams handle typos, inflections, hashtags, and unseen words.
  • They are fast to compute and work well with linear models.

Mental model

Imagine a sliding window moving across text. At each step, it records what's inside (a word pair or a character group). Then you count how often each windowed pattern appears. Optionally, you reweight counts with TF-IDF so patterns common across all documents matter less than patterns distinctive to a few.

Key feature types and knobs

  • Word n-grams: set ngram_range=(1,2) for unigrams+bigrams. Good for context and phrases.
  • Character n-grams: common ranges are (3,5). Robust to typos, prefixes/suffixes, and mixed scripts.
  • Weighting:
    • Counts: raw frequency.
    • Binary: presence/absence (useful when repetition doesn't add meaning).
    • TF-IDF: downweights ubiquitous n-grams, highlights distinctive ones.
  • Vocabulary control: min_df to drop rare n-grams; max_df to drop very common ones.
  • Hashing trick: fixed-size feature space without storing a vocabulary. Collisions can happen but are usually tolerable with a large dimensionality.
  • Preprocessing: lowercasing, basic normalization, stopword handling; keep punctuation if it carries meaning (e.g., "!", emojis).
Edge cases and tips
  • Short texts: prefer character n-grams and binary weighting.
  • Noisy user input: character n-grams are often more robust than words.
  • Domain jargon/codes (e.g., SKUs): character n-grams shine.
  • Multilingual data: character n-grams often transfer better across languages.

Worked examples

Example 1 β€” Word unigrams and bigrams

Sentence: "not good at all but good service"

  • Tokens: [not, good, at, all, but, good, service]
  • Unigrams (counts): not:1, good:2, at:1, all:1, but:1, service:1
  • Bigrams (counts): "not good":1, "good at":1, "at all":1, "all but":1, "but good":1, "good service":1

Notice how bigrams differentiate "not good" from the positive word "good" alone.

Example 2 β€” Character trigrams

Text: "fantastic!" (lowercased)

  • Characters: f a n t a s t i c !
  • 3-grams: fan, ant, nta, tas, ast, sti, tic, ic!

Even if a user types "fantstic!" by mistake, many trigrams (tas, ast, sti, tic) still overlap, helping the model recover meaning.

Example 3 β€” Tiny TF-IDF

Corpus (3 docs):

D1: "good service"
D2: "not good"
D3: "good price"

Unigrams: good, service, not, price

  • DF: good=3, service=1, not=1, price=1
  • IDF (log((N+1)/(DF+1))+1, with N=3):
    • good: log((4/4))+1 = 1.0
    • service: log((4/2))+1 β‰ˆ 1.693
    • not: log((4/2))+1 β‰ˆ 1.693
    • price: log((4/2))+1 β‰ˆ 1.693
  • TF (raw counts). In D2: "not"=1, "good"=1 β†’ TF-IDF for D2:
    • good: 1 Γ— 1.0 = 1.0
    • not: 1 Γ— 1.693 β‰ˆ 1.693

"not" becomes more influential than the ubiquitous "good" after TF-IDF weighting.

Example 4 β€” Hashing trick intuition

Map each n-gram to an index via a hash function modulo a large dimension (e.g., 2^20). Two different n-grams may collide, but with a large space and sparse features, it rarely hurts performance. You avoid storing a growing vocabulary and keep memory predictable.

Practical steps (from text to model)

Step 1. Normalize text: lowercase, standardize whitespace. Keep punctuation/emojis if useful.
Step 2. Choose feature family:
  • Clean formal text β†’ word n-grams (1–2).
  • Noisy/short text β†’ character n-grams (3–5) or combine with word unigrams.
Step 3. Weighting: start with TF-IDF; consider binary for very short texts.
Step 4. Control vocabulary: set min_df to drop ultra-rare n-grams; max_df to drop stopword-like items.
Step 5. Train a linear model (Logistic Regression/SVM). Use regularization (C for LR/SVM) to handle high dimensionality.
Step 6. Evaluate with a validation split. Compare word vs char features and combinations.

Common mistakes and self-check

  • Mistake: Using only unigrams on short/noisy text. Fix: add char 3–5-grams.
  • Mistake: Letting the vocabulary explode. Fix: set min_df, limit ngram_range, or use hashing.
  • Mistake: Dropping punctuation/emojis that carry sentiment. Fix: keep them or extract as characters.
  • Mistake: Overfitting with too many features and high C. Fix: increase regularization or reduce features.
  • Mistake: Misinterpreting TF-IDF. Fix: remember it downweights across-corpus frequent terms.
Self-check prompts
  • Did validation metrics improve when adding bigrams or char n-grams?
  • Is the feature space size stable and memory-safe?
  • Do top weighted n-grams make sense for the task?

Exercises

Complete these exercises here, then compare with the solutions. The same tasks appear below the article so your progress can be saved if you are logged in.

Exercise 1 β€” Word bigrams with TF-IDF (paper exercise)

Corpus:

D1: "great product and great support"
D2: "not great product"
D3: "great price and support"

Task:

  • Use word n-grams with ngram_range=(1,2).
  • Compute TF-IDF using IDF = log((N+1)/(DF+1)) + 1, N=3.
  • For D2, list TF-IDF weights for: "great", "product", "not", and bigrams "not great", "great product" (round to 3 decimals).
Exercise 2 β€” Character 3-grams and robustness

Words: "battery", "battary", "batery" (lowercased). Extract char 3-grams for each and compute Jaccard similarity between:

  • battery vs battary
  • battery vs batery

Jaccard(A,B) = |A ∩ B| / |A βˆͺ B|. List the sets and the two similarity scores (round to 2 decimals).

  • Checklist before you submit:
    • Used the provided IDF formula and rounding.
    • Included both unigrams and bigrams where requested.
    • For char 3-grams, counted only contiguous sequences.

Practical projects

  • Tweet sentiment baseline: compare unigram+bigram TF-IDF vs char 3–5-gram TF-IDF on short texts with emojis.
  • Spam filter for support tickets: combine word unigrams with char 4-grams; evaluate precision/recall on rare spam phrases.
  • Product category classifier: test hashing trick with 2^18 dimensions vs explicit vocabulary; report impact of collisions.

Mini challenge

You must classify app reviews as positive/negative. Data is noisy and short. You try three setups with Logistic Regression:

  1. Word unigrams only, TF-IDF.
  2. Word unigrams+bigrams, TF-IDF.
  3. Character 3–5-grams, binary.

Predict which two will perform best and why. Then justify whether adding both word bigrams and char 3–5-grams could be complementary or redundant in this setting.

Learning path

  • Now: N-grams and character features (this lesson).
  • Next: Feature selection and dimensionality control (min_df/max_df, chi-square selection).
  • Then: Linguistic enrichments (lemmatization, POS tags as features) for classical models.
  • Finally: Compare with neural embeddings; learn when classical features win on speed/interpretability.

Next steps

  • Take the quick test below to lock in the concepts. Anyone can take it; only logged-in users get saved progress.
  • Apply to a small dataset you know. Aim for a strong, simple baseline first.

Practice Exercises

2 exercises to complete

Instructions

Corpus:

D1: "great product and great support"
D2: "not great product"
D3: "great price and support"

Use word n-grams with ngram_range=(1,2). Compute TF-IDF with IDF = log((N+1)/(DF+1)) + 1 where N=3. For D2, list TF-IDF for: unigrams "great", "product", "not" and bigrams "not great", "great product". Round to 3 decimals.

Expected Output
A list of five TF-IDF weights for D2: great, product, not, not great, great product (rounded to 3 decimals).

N Grams And Character Features β€” Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about N Grams And Character Features?

AI Assistant

Ask questions about this tool