Why this matters
As an NLP Engineer, you often need a fast, reliable baseline to classify text (spam vs. ham, sentiment, topic routing) or to compute document similarity (search, deduplication). Bag-of-Words (BoW) and TF-IDF are the classic features that power these tasks, especially when data is small-to-medium or when deep models are overkill. They are transparent, quick to train, and easy to debug.
- Create baseline classifiers for new text problems in minutes.
- Build interpretable features for compliance-sensitive use cases.
- Enable search and ranking via cosine similarity over TF-IDF vectors.
Who this is for
Beginners to intermediate practitioners who want strong, interpretable NLP baselines and who need to understand how token-level statistics become features.
Prerequisites
- Comfort with Python basics (lists, dicts)
- Familiarity with simple math (counts, logs)
- Basic understanding of what a token/word is
Concept explained simply
BoW turns each document into a long vector of word counts. The vector has one position per vocabulary term. TF-IDF improves this by down-weighting words that appear in many documents (like "the") and up-weighting words that are distinctive (like "refund").
Mental model
Imagine a big spreadsheet:
- Columns = words (vocabulary)
- Rows = documents
- BoW cell = how many times that word appears in that document
- TF-IDF cell = how important that word is in that document relative to the entire corpus
Common words shrink; rare-but-relevant words stand out. Then you train a classifier or compute cosine similarity on these vectors.
Core definitions and formulas
- Term Frequency (TF): number of times a term t appears in document d. Often raw count; sometimes normalized by document length.
- Document Frequency (DF): number of documents that contain term t.
- Inverse Document Frequency (IDF): a common smoothed version is idf(t) = log((N + 1) / (DF(t) + 1)) + 1, where N is the number of documents.
- TF-IDF: tfidf(t, d) = TF(t, d) * IDF(t), usually followed by L2 normalization per document.
Preprocessing choices that matter
- Tokenization: words, subwords, or character n-grams.
- Casing: lowercasing often helps; keep case if it signals meaning (e.g., product codes).
- Stopwords: removing common words can help, but evaluate on a dev set.
- N-grams: include bigrams/trigrams for short texts or phrases (e.g., "credit card").
- Vocabulary pruning: set min_df/max_df or top-k features to control size and noise.
- Normalization: L2 normalization improves cosine-based similarity and many linear models.
Worked examples
Example 1: Tiny corpus by hand
Corpus (3 docs):
- "love pizza love"
- "hate pizza"
- "love pasta"
Vocabulary: love, pizza, hate, pasta
Step-by-step TF and DF
- TF(d1): love=2, pizza=1, hate=0, pasta=0
- TF(d2): love=0, pizza=1, hate=1, pasta=0
- TF(d3): love=1, pizza=0, hate=0, pasta=1
- DF: love=2 (d1,d3), pizza=2 (d1,d2), hate=1 (d2), pasta=1 (d3)
- N=3; IDF(t)=log((3+1)/(DF+1))+1
- IDF(love)=log(4/3)+1≈1.2877
- IDF(pizza)=log(4/3)+1≈1.2877
- IDF(hate)=log(4/2)+1≈1.6931
- IDF(pasta)=log(4/2)+1≈1.6931
TF-IDF vectors (before L2-normalization)
- d1: [love=2*1.2877, pizza=1*1.2877, hate=0, pasta=0] ≈ [2.5754, 1.2877, 0, 0]
- d2: [0, 1*1.2877, 1*1.6931, 0] ≈ [0, 1.2877, 1.6931, 0]
- d3: [1*1.2877, 0, 0, 1*1.6931] ≈ [1.2877, 0, 0, 1.6931]
Then apply L2 normalization per document if desired.
Example 2: Why TF-IDF helps
In support tickets, the word "the" appears everywhere and is not predictive. TF-IDF down-weights it. The phrase "refund" or bigram "credit card" becomes comparatively strong, helping classifiers focus on signal, not noise.
Example 3: Cosine similarity for search
Represent the query and documents as L2-normalized TF-IDF vectors. Cosine similarity is then the dot product. Documents with overlapping distinctive terms rank higher, enabling simple keyword-based search with good relevance.
Practical usage (Python)
CountVectorizer and TfidfVectorizer snippet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
"love pizza love",
"hate pizza",
"love pasta"
]
# Bag-of-Words counts
cv = CountVectorizer(lowercase=True, ngram_range=(1,1), min_df=1)
X_counts = cv.fit_transform(corpus)
print("BoW shape:", X_counts.shape)
print("Vocabulary:", cv.get_feature_names_out())
# TF-IDF
tfidf = TfidfVectorizer(lowercase=True, ngram_range=(1,1), min_df=1, norm='l2')
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF shape:", X_tfidf.shape)
print("Sample TF-IDF row:", X_tfidf[0].toarray())
Notes:
- Use fit on train only; transform on validation/test to avoid leakage.
- Set max_features or min_df to control vocabulary size.
- Try ngram_range=(1,2) for short texts.
Common mistakes and self-check
- Data leakage: fitting vectorizers on full data. Self-check: Did you call fit only on training data?
- Uncontrolled vocabulary growth: memory blow-ups. Self-check: Did you set max_features/min_df?
- Ignoring normalization: cosine similarity becomes meaningless. Self-check: Is norm='l2' used for TF-IDF similarity tasks?
- Over-aggressive stopword removal: losing sentiment or domain terms (e.g., "not"). Self-check: Inspect top features; is "not" missing when it matters?
- Mismatched preprocessing between train and inference. Self-check: Is the same tokenizer/casing applied everywhere?
Exercises
These exercises mirror the tasks below. You can do them here and then compare with the solutions.
- Exercise 1: Compute TF-IDF by hand on a tiny corpus and L2-normalize vectors. Verify you get the same order of importance as the worked example.
- Exercise 2: Use a vectorizer to compare unigrams vs. bigrams on short texts. Observe which features become most informative.
Self-checklist
Mini challenge
You have 2000 short app reviews. Build two baselines: (a) BoW + Logistic Regression, (b) TF-IDF + Logistic Regression. Use unigram+bigrams and cap vocabulary to 20k. Which setup wins on validation F1, and why? Write 3 bullet points reflecting on the most influential features.
Practical projects
- Spam filter baseline: Train a TF-IDF + linear classifier on a labeled email dataset. Report precision/recall and top positive/negative features.
- FAQ matcher: Index FAQs with TF-IDF, then match user queries via cosine similarity. Add bigrams and compare ranking improvements.
- News topic router: Fit TF-IDF with min_df and max_df, then train a linear SVM. Inspect misclassifications and adjust preprocessing.
Learning path
- Master BoW and TF-IDF basics (this page).
- Experiment with n-grams, stopwords, and normalization.
- Feature selection: chi-square or mutual information to prune features.
- Compare classifiers: Logistic Regression, Linear SVM, Naive Bayes.
- Move to embeddings when data and task justify it, keeping BoW/TF-IDF as a strong baseline.
Next steps
- Complete the exercises and take the quick test below.
- Apply TF-IDF to a small dataset you already have and inspect top features.
- Iterate preprocessing (casing, n-grams, min_df) and note validation changes.
About progress saving
The quick test is available to everyone; only logged-in users get saved progress.