How to learn Bag Of Words And Tf Idf for Feature Engineering For Classical NLP in NLP Engineer for free

Why this matters

As an NLP Engineer, you often need a fast, reliable baseline to classify text (spam vs. ham, sentiment, topic routing) or to compute document similarity (search, deduplication). Bag-of-Words (BoW) and TF-IDF are the classic features that power these tasks, especially when data is small-to-medium or when deep models are overkill. They are transparent, quick to train, and easy to debug.

Create baseline classifiers for new text problems in minutes.
Build interpretable features for compliance-sensitive use cases.
Enable search and ranking via cosine similarity over TF-IDF vectors.

Who this is for

Beginners to intermediate practitioners who want strong, interpretable NLP baselines and who need to understand how token-level statistics become features.

Prerequisites

Comfort with Python basics (lists, dicts)
Familiarity with simple math (counts, logs)
Basic understanding of what a token/word is

Concept explained simply

BoW turns each document into a long vector of word counts. The vector has one position per vocabulary term. TF-IDF improves this by down-weighting words that appear in many documents (like "the") and up-weighting words that are distinctive (like "refund").

Mental model

Imagine a big spreadsheet:

Columns = words (vocabulary)
Rows = documents
BoW cell = how many times that word appears in that document
TF-IDF cell = how important that word is in that document relative to the entire corpus

Common words shrink; rare-but-relevant words stand out. Then you train a classifier or compute cosine similarity on these vectors.

Core definitions and formulas

Term Frequency (TF): number of times a term t appears in document d. Often raw count; sometimes normalized by document length.
Document Frequency (DF): number of documents that contain term t.
Inverse Document Frequency (IDF): a common smoothed version is idf(t) = log((N + 1) / (DF(t) + 1)) + 1, where N is the number of documents.
TF-IDF: tfidf(t, d) = TF(t, d) * IDF(t), usually followed by L2 normalization per document.

Preprocessing choices that matter

Tokenization: words, subwords, or character n-grams.
Casing: lowercasing often helps; keep case if it signals meaning (e.g., product codes).
Stopwords: removing common words can help, but evaluate on a dev set.
N-grams: include bigrams/trigrams for short texts or phrases (e.g., "credit card").
Vocabulary pruning: set min_df/max_df or top-k features to control size and noise.
Normalization: L2 normalization improves cosine-based similarity and many linear models.

Worked examples

Example 1: Tiny corpus by hand

Corpus (3 docs):

"love pizza love"
"hate pizza"
"love pasta"

Vocabulary: love, pizza, hate, pasta

Step-by-step TF and DF

TF(d1): love=2, pizza=1, hate=0, pasta=0
TF(d2): love=0, pizza=1, hate=1, pasta=0
TF(d3): love=1, pizza=0, hate=0, pasta=1
DF: love=2 (d1,d3), pizza=2 (d1,d2), hate=1 (d2), pasta=1 (d3)
N=3; IDF(t)=log((3+1)/(DF+1))+1
IDF(love)=log(4/3)+1≈1.2877
IDF(pizza)=log(4/3)+1≈1.2877
IDF(hate)=log(4/2)+1≈1.6931
IDF(pasta)=log(4/2)+1≈1.6931

TF-IDF vectors (before L2-normalization)

d1: [love=2*1.2877, pizza=1*1.2877, hate=0, pasta=0] ≈ [2.5754, 1.2877, 0, 0]
d2: [0, 1*1.2877, 1*1.6931, 0] ≈ [0, 1.2877, 1.6931, 0]
d3: [1*1.2877, 0, 0, 1*1.6931] ≈ [1.2877, 0, 0, 1.6931]

Then apply L2 normalization per document if desired.

Example 2: Why TF-IDF helps

In support tickets, the word "the" appears everywhere and is not predictive. TF-IDF down-weights it. The phrase "refund" or bigram "credit card" becomes comparatively strong, helping classifiers focus on signal, not noise.

Example 3: Cosine similarity for search

Represent the query and documents as L2-normalized TF-IDF vectors. Cosine similarity is then the dot product. Documents with overlapping distinctive terms rank higher, enabling simple keyword-based search with good relevance.

Practical usage (Python)

CountVectorizer and TfidfVectorizer snippet

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "love pizza love",
    "hate pizza",
    "love pasta"
]

# Bag-of-Words counts
cv = CountVectorizer(lowercase=True, ngram_range=(1,1), min_df=1)
X_counts = cv.fit_transform(corpus)
print("BoW shape:", X_counts.shape)
print("Vocabulary:", cv.get_feature_names_out())

# TF-IDF
tfidf = TfidfVectorizer(lowercase=True, ngram_range=(1,1), min_df=1, norm='l2')
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF shape:", X_tfidf.shape)
print("Sample TF-IDF row:", X_tfidf[0].toarray())

Notes:

Use fit on train only; transform on validation/test to avoid leakage.
Set max_features or min_df to control vocabulary size.
Try ngram_range=(1,2) for short texts.

Common mistakes and self-check

Data leakage: fitting vectorizers on full data. Self-check: Did you call fit only on training data?
Uncontrolled vocabulary growth: memory blow-ups. Self-check: Did you set max_features/min_df?
Ignoring normalization: cosine similarity becomes meaningless. Self-check: Is norm='l2' used for TF-IDF similarity tasks?
Over-aggressive stopword removal: losing sentiment or domain terms (e.g., "not"). Self-check: Inspect top features; is "not" missing when it matters?
Mismatched preprocessing between train and inference. Self-check: Is the same tokenizer/casing applied everywhere?

Exercises

These exercises mirror the tasks below. You can do them here and then compare with the solutions.

Exercise 1: Compute TF-IDF by hand on a tiny corpus and L2-normalize vectors. Verify you get the same order of importance as the worked example.
Exercise 2: Use a vectorizer to compare unigrams vs. bigrams on short texts. Observe which features become most informative.

Self-checklist

I computed DF and IDF with smoothing correctly.
I applied L2 normalization per document when comparing cosine similarity.
I fit vectorizers on train only and transformed validation/test.
I tested both unigram and bigram settings and inspected vocabulary size.

Mini challenge

You have 2000 short app reviews. Build two baselines: (a) BoW + Logistic Regression, (b) TF-IDF + Logistic Regression. Use unigram+bigrams and cap vocabulary to 20k. Which setup wins on validation F1, and why? Write 3 bullet points reflecting on the most influential features.

Practical projects

Spam filter baseline: Train a TF-IDF + linear classifier on a labeled email dataset. Report precision/recall and top positive/negative features.
FAQ matcher: Index FAQs with TF-IDF, then match user queries via cosine similarity. Add bigrams and compare ranking improvements.
News topic router: Fit TF-IDF with min_df and max_df, then train a linear SVM. Inspect misclassifications and adjust preprocessing.

Learning path

Master BoW and TF-IDF basics (this page).
Experiment with n-grams, stopwords, and normalization.
Feature selection: chi-square or mutual information to prune features.
Compare classifiers: Logistic Regression, Linear SVM, Naive Bayes.
Move to embeddings when data and task justify it, keeping BoW/TF-IDF as a strong baseline.

Next steps

Complete the exercises and take the quick test below.
Apply TF-IDF to a small dataset you already have and inspect top features.
Iterate preprocessing (casing, n-grams, min_df) and note validation changes.

About progress saving

The quick test is available to everyone; only logged-in users get saved progress.

Menu

Bag Of Words And Tf Idf

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Core definitions and formulas

Worked examples

Example 1: Tiny corpus by hand

Example 2: Why TF-IDF helps

Example 3: Cosine similarity for search

Practical usage (Python)

Common mistakes and self-check

Exercises

Mini challenge

Practical projects

Learning path

Next steps

Practice Exercises

Compute TF-IDF by hand on a tiny corpus

Instructions

Expected Output

Compare unigrams vs. bigrams

Bag Of Words And Tf Idf — Quick Test

Have questions about Bag Of Words And Tf Idf?

AI Assistant