luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Bag Of Words And Tf Idf

Learn Bag Of Words And Tf Idf for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As an NLP Engineer, you often need a fast, reliable baseline to classify text (spam vs. ham, sentiment, topic routing) or to compute document similarity (search, deduplication). Bag-of-Words (BoW) and TF-IDF are the classic features that power these tasks, especially when data is small-to-medium or when deep models are overkill. They are transparent, quick to train, and easy to debug.

  • Create baseline classifiers for new text problems in minutes.
  • Build interpretable features for compliance-sensitive use cases.
  • Enable search and ranking via cosine similarity over TF-IDF vectors.

Who this is for

Beginners to intermediate practitioners who want strong, interpretable NLP baselines and who need to understand how token-level statistics become features.

Prerequisites

  • Comfort with Python basics (lists, dicts)
  • Familiarity with simple math (counts, logs)
  • Basic understanding of what a token/word is

Concept explained simply

BoW turns each document into a long vector of word counts. The vector has one position per vocabulary term. TF-IDF improves this by down-weighting words that appear in many documents (like "the") and up-weighting words that are distinctive (like "refund").

Mental model

Imagine a big spreadsheet:

  • Columns = words (vocabulary)
  • Rows = documents
  • BoW cell = how many times that word appears in that document
  • TF-IDF cell = how important that word is in that document relative to the entire corpus

Common words shrink; rare-but-relevant words stand out. Then you train a classifier or compute cosine similarity on these vectors.

Core definitions and formulas

  • Term Frequency (TF): number of times a term t appears in document d. Often raw count; sometimes normalized by document length.
  • Document Frequency (DF): number of documents that contain term t.
  • Inverse Document Frequency (IDF): a common smoothed version is idf(t) = log((N + 1) / (DF(t) + 1)) + 1, where N is the number of documents.
  • TF-IDF: tfidf(t, d) = TF(t, d) * IDF(t), usually followed by L2 normalization per document.
Preprocessing choices that matter
  • Tokenization: words, subwords, or character n-grams.
  • Casing: lowercasing often helps; keep case if it signals meaning (e.g., product codes).
  • Stopwords: removing common words can help, but evaluate on a dev set.
  • N-grams: include bigrams/trigrams for short texts or phrases (e.g., "credit card").
  • Vocabulary pruning: set min_df/max_df or top-k features to control size and noise.
  • Normalization: L2 normalization improves cosine-based similarity and many linear models.

Worked examples

Example 1: Tiny corpus by hand

Corpus (3 docs):

  1. "love pizza love"
  2. "hate pizza"
  3. "love pasta"

Vocabulary: love, pizza, hate, pasta

Step-by-step TF and DF
  • TF(d1): love=2, pizza=1, hate=0, pasta=0
  • TF(d2): love=0, pizza=1, hate=1, pasta=0
  • TF(d3): love=1, pizza=0, hate=0, pasta=1
  • DF: love=2 (d1,d3), pizza=2 (d1,d2), hate=1 (d2), pasta=1 (d3)
  • N=3; IDF(t)=log((3+1)/(DF+1))+1
  • IDF(love)=log(4/3)+1≈1.2877
  • IDF(pizza)=log(4/3)+1≈1.2877
  • IDF(hate)=log(4/2)+1≈1.6931
  • IDF(pasta)=log(4/2)+1≈1.6931
TF-IDF vectors (before L2-normalization)
  • d1: [love=2*1.2877, pizza=1*1.2877, hate=0, pasta=0] ≈ [2.5754, 1.2877, 0, 0]
  • d2: [0, 1*1.2877, 1*1.6931, 0] ≈ [0, 1.2877, 1.6931, 0]
  • d3: [1*1.2877, 0, 0, 1*1.6931] ≈ [1.2877, 0, 0, 1.6931]

Then apply L2 normalization per document if desired.

Example 2: Why TF-IDF helps

In support tickets, the word "the" appears everywhere and is not predictive. TF-IDF down-weights it. The phrase "refund" or bigram "credit card" becomes comparatively strong, helping classifiers focus on signal, not noise.

Represent the query and documents as L2-normalized TF-IDF vectors. Cosine similarity is then the dot product. Documents with overlapping distinctive terms rank higher, enabling simple keyword-based search with good relevance.

Practical usage (Python)

CountVectorizer and TfidfVectorizer snippet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "love pizza love",
    "hate pizza",
    "love pasta"
]

# Bag-of-Words counts
cv = CountVectorizer(lowercase=True, ngram_range=(1,1), min_df=1)
X_counts = cv.fit_transform(corpus)
print("BoW shape:", X_counts.shape)
print("Vocabulary:", cv.get_feature_names_out())

# TF-IDF
tfidf = TfidfVectorizer(lowercase=True, ngram_range=(1,1), min_df=1, norm='l2')
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF shape:", X_tfidf.shape)
print("Sample TF-IDF row:", X_tfidf[0].toarray())

Notes:

  • Use fit on train only; transform on validation/test to avoid leakage.
  • Set max_features or min_df to control vocabulary size.
  • Try ngram_range=(1,2) for short texts.

Common mistakes and self-check

  • Data leakage: fitting vectorizers on full data. Self-check: Did you call fit only on training data?
  • Uncontrolled vocabulary growth: memory blow-ups. Self-check: Did you set max_features/min_df?
  • Ignoring normalization: cosine similarity becomes meaningless. Self-check: Is norm='l2' used for TF-IDF similarity tasks?
  • Over-aggressive stopword removal: losing sentiment or domain terms (e.g., "not"). Self-check: Inspect top features; is "not" missing when it matters?
  • Mismatched preprocessing between train and inference. Self-check: Is the same tokenizer/casing applied everywhere?

Exercises

These exercises mirror the tasks below. You can do them here and then compare with the solutions.

  1. Exercise 1: Compute TF-IDF by hand on a tiny corpus and L2-normalize vectors. Verify you get the same order of importance as the worked example.
  2. Exercise 2: Use a vectorizer to compare unigrams vs. bigrams on short texts. Observe which features become most informative.
Self-checklist

Mini challenge

You have 2000 short app reviews. Build two baselines: (a) BoW + Logistic Regression, (b) TF-IDF + Logistic Regression. Use unigram+bigrams and cap vocabulary to 20k. Which setup wins on validation F1, and why? Write 3 bullet points reflecting on the most influential features.

Practical projects

  • Spam filter baseline: Train a TF-IDF + linear classifier on a labeled email dataset. Report precision/recall and top positive/negative features.
  • FAQ matcher: Index FAQs with TF-IDF, then match user queries via cosine similarity. Add bigrams and compare ranking improvements.
  • News topic router: Fit TF-IDF with min_df and max_df, then train a linear SVM. Inspect misclassifications and adjust preprocessing.

Learning path

  1. Master BoW and TF-IDF basics (this page).
  2. Experiment with n-grams, stopwords, and normalization.
  3. Feature selection: chi-square or mutual information to prune features.
  4. Compare classifiers: Logistic Regression, Linear SVM, Naive Bayes.
  5. Move to embeddings when data and task justify it, keeping BoW/TF-IDF as a strong baseline.

Next steps

  • Complete the exercises and take the quick test below.
  • Apply TF-IDF to a small dataset you already have and inspect top features.
  • Iterate preprocessing (casing, n-grams, min_df) and note validation changes.
About progress saving

The quick test is available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Corpus (3 docs):

  1. "love pizza love"
  2. "hate pizza"
  3. "love pasta"

Tasks:

  • Build the vocabulary and compute TF and DF for each term.
  • Compute smoothed IDF: idf(t) = log((N+1)/(DF(t)+1)) + 1 with N=3.
  • Compute TF-IDF vectors for each doc and then L2-normalize each row.
  • Which term is most distinctive overall?
Expected Output
Correct DF, IDF approximations (love≈1.2877, pizza≈1.2877, hate≈1.6931, pasta≈1.6931), TF-IDF vectors before/after L2 norm, and identification of hate/pasta as more distinctive.

Bag Of Words And Tf Idf — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Bag Of Words And Tf Idf?

AI Assistant

Ask questions about this tool