Why this matters
TF-IDF converts text into numeric vectors that highlight words that are distinctive for a document. As a Data Scientist, you will use TF-IDF to:
- Build baseline text classifiers (e.g., spam detection, sentiment, topic tags).
- Rank search results by relevance to a query.
- Detect near-duplicate content using cosine similarity of TF-IDF vectors.
- Create lightweight features for clustering documents.
TF-IDF is simple, fast, and often competitive with more complex methods on small/medium datasets.
Concept explained simply
Idea: A word is important in a document if it occurs often in that document (TF) but not in many documents overall (IDF).
TF (term frequency): how often a term appears in a document. IDF (inverse document frequency): how rare the term is across the corpus. TF-IDF = TF * IDF
Mental model
Imagine a spotlight. TF turns up the brightness on terms repeated within a document. IDF dims terms that are common everywhere (like "the") and brightens rare but informative terms (like "mitochondria"). The final TF-IDF vector is what your model sees.
Common formula variants you should know
- Raw TF: tf(t, d) = count of t in d
- Binary TF: tf(t, d) = 1 if present else 0
- Log TF: tf(t, d) = 1 + log(count) when count > 0
- IDF (smoothed): idf(t) = ln((1 + N) / (1 + df(t))) + 1, where N is number of documents and df(t) is document frequency
- Normalization: L2-normalize each document vector so cosine similarity is meaningful
What decisions you make when using TF-IDF
- Tokenization: lowercasing, punctuation handling, simple whitespace split vs. smarter tokenizers.
- Stopwords: remove very common words or use max_df to down-weight them.
- N-grams: unigrams (single words) vs. bigrams/trigrams to capture phrases.
- Vocabulary curation: min_df (ignore too-rare terms), max_df (ignore too-common terms), max_features (cap feature count).
- TF choice: raw counts vs. binary vs. log-scaled (sublinear).
- Normalization: typically L2 per document.
Worked examples
Example 1 β Manual TF-IDF on a tiny corpus
Corpus (N=3):
- D1: "the cat sat on the mat"
- D2: "the dog sat on the log"
- D3: "the cat chased the dog"
Use smoothed IDF: idf(t) = ln((1+N)/(1+df(t))) + 1 with N=3.
- df values: the=3, cat=2, sat=2, on=2, mat=1, dog=2, log=1, chased=1
- idf: the=1.0000, cat=1.2877, sat=1.2877, on=1.2877, mat=1.6931, dog=1.2877, log=1.6931, chased=1.6931 (rounded)
For D2 "the dog sat on the log": raw TF = the:2, dog:1, sat:1, on:1, log:1.
Unnormalized TF-IDF (tf * idf):
- the: 2.0000
- dog: 1.2877
- sat: 1.2877
- on: 1.2877
- log: 1.6931
L2 norm β sqrt(4 + 3Γ1.2877^2 + 1.6931^2) β 3.4425.
L2-normalized weights:
- the: 0.5812
- dog: 0.3742
- sat: 0.3742
- on: 0.3742
- log: 0.4920
Example 2 β Why n-grams can help
Text: "new york is big". Unigrams miss the phrase. With bigrams, the feature "new york" gets its own weight, capturing the entity rather than two separate words.
- Unigrams: new, york, is, big
- Bigrams: new york, york is, is big
- TF-IDF can assign a strong weight to "new york" if itβs informative across the corpus.
Example 3 β Using cosine similarity for relevance
Query: "red apple"
Docs:
- D1: "green apple salad"
- D2: "red apple pie"
- D3: "ripe banana"
After TF-IDF + L2, the cosine similarity between the query vector and D2 will be highest, because of the shared informative terms "red" and "apple". This is a simple baseline for search ranking.
Step-by-step: implementing TF-IDF on a small dataset
- Collect a small corpus of documents and lowercase them.
- Tokenize: split on whitespace and strip punctuation.
- Filter: optionally remove stopwords; set min_df and max_df.
- Build vocabulary (unigrams and optionally bigrams).
- Count term frequencies per document (raw counts or log-scaled).
- Compute IDF with smoothing: idf(t) = ln((1+N)/(1+df(t))) + 1.
- Compute TF-IDF: multiply TF by IDF.
- Normalize each document vector (L2).
- Evaluate with a simple model (logistic regression/SVM) or use cosine similarity for search-like tasks.
Mini tasks
- Toggle stopword removal and note how "the" weight changes.
- Try unigrams vs. unigrams+bigrams and compare validation accuracy.
- Cap max_features (e.g., 5,000) and check speed vs. performance.
Exercises (you can do these offline)
These mirror the exercises below. Do them first, then open the solutions.
- Exercise 1: Compute smoothed IDF and TF-IDF (with and without L2 normalization) for a tiny corpus by hand.
- Exercise 2: Apply min_df, max_df, and 1β2 grams to decide which features remain; compute normalized TF-IDF for one document.
- Checklist to self-verify:
- You used smoothed IDF and showed df counts.
- You reported both raw and L2-normalized vectors where asked.
- You clearly listed the surviving vocabulary for Exercise 2.
Common mistakes and how to self-check
- Forgetting normalization: Without L2, cosine similarity and linear models can be skewed by document length. Self-check: compute norms; they should be 1.0 after normalization.
- Dropping useful rare phrases by mistake: An aggressive min_df can remove signal. Self-check: inspect top features per class and ensure domain terms remain.
- Leaking test data into IDF: Fit vocabulary/IDF on training only. Self-check: confirm that vectorizer is fit only on the train split.
- Overusing stopword lists: Some common words are informative in certain domains. Self-check: compare accuracy with max_df vs. stopwords removal.
- Ignoring n-grams where phrases matter: If your domain relies on phrases (e.g., named entities), include bigrams. Self-check: run A/B with bigrams and evaluate.
Practical projects
- Spam vs. ham classifier: TF-IDF (1β2 grams) + logistic regression. Report F1 and most informative features.
- Simple search engine: Build TF-IDF vectors for a set of articles; rank articles by cosine similarity to user queries.
- News clustering: TF-IDF + KMeans. Inspect cluster keywords to label clusters.
Who this is for
- Beginner to intermediate Data Scientists who need a strong, fast text baseline.
- Engineers moving into ML who want interpretable text features.
Prerequisites
- Basic Python or similar (for implementation).
- Understanding of vectors, dot product, and cosine similarity.
- Familiarity with train/validation/test splits.
Learning path
- Start here: TF-IDF basics and hands-on practice.
- Then: Regularization and linear models for text (logistic regression/SVM).
- Next: More advanced text features (character n-grams, hashing trick).
- Later: Neural approaches (embeddings, transformers) for complex tasks.
Next steps
- Try your TF-IDF pipeline on a real dataset (reviews, tickets, emails).
- Tune min_df, max_df, n-grams, and sublinear TF; keep notes on what moves metrics.
- Scroll to the Quick Test at the end. Your test is available to everyone; progress saving is available if you are logged in.
Mini challenge
Given short product reviews, create a 1β2 gram TF-IDF model with L2 normalization, train a logistic regression classifier for positive/negative sentiment, and report:
- Validation accuracy and F1.
- Top 10 positive and negative n-grams by weight.
- One change that improved performance the most (and why).