luvv to helpDiscover the Best Free Online Tools
Topic 4 of 9

Text Features Tf Idf Basics

Learn Text Features Tf Idf Basics for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

TF-IDF converts text into numeric vectors that highlight words that are distinctive for a document. As a Data Scientist, you will use TF-IDF to:

  • Build baseline text classifiers (e.g., spam detection, sentiment, topic tags).
  • Rank search results by relevance to a query.
  • Detect near-duplicate content using cosine similarity of TF-IDF vectors.
  • Create lightweight features for clustering documents.

TF-IDF is simple, fast, and often competitive with more complex methods on small/medium datasets.

Concept explained simply

Idea: A word is important in a document if it occurs often in that document (TF) but not in many documents overall (IDF).

TF (term frequency): how often a term appears in a document.
IDF (inverse document frequency): how rare the term is across the corpus.
TF-IDF = TF * IDF
Mental model

Imagine a spotlight. TF turns up the brightness on terms repeated within a document. IDF dims terms that are common everywhere (like "the") and brightens rare but informative terms (like "mitochondria"). The final TF-IDF vector is what your model sees.

Common formula variants you should know
  • Raw TF: tf(t, d) = count of t in d
  • Binary TF: tf(t, d) = 1 if present else 0
  • Log TF: tf(t, d) = 1 + log(count) when count > 0
  • IDF (smoothed): idf(t) = ln((1 + N) / (1 + df(t))) + 1, where N is number of documents and df(t) is document frequency
  • Normalization: L2-normalize each document vector so cosine similarity is meaningful

What decisions you make when using TF-IDF

  • Tokenization: lowercasing, punctuation handling, simple whitespace split vs. smarter tokenizers.
  • Stopwords: remove very common words or use max_df to down-weight them.
  • N-grams: unigrams (single words) vs. bigrams/trigrams to capture phrases.
  • Vocabulary curation: min_df (ignore too-rare terms), max_df (ignore too-common terms), max_features (cap feature count).
  • TF choice: raw counts vs. binary vs. log-scaled (sublinear).
  • Normalization: typically L2 per document.

Worked examples

Example 1 β€” Manual TF-IDF on a tiny corpus

Corpus (N=3):

  1. D1: "the cat sat on the mat"
  2. D2: "the dog sat on the log"
  3. D3: "the cat chased the dog"

Use smoothed IDF: idf(t) = ln((1+N)/(1+df(t))) + 1 with N=3.

  • df values: the=3, cat=2, sat=2, on=2, mat=1, dog=2, log=1, chased=1
  • idf: the=1.0000, cat=1.2877, sat=1.2877, on=1.2877, mat=1.6931, dog=1.2877, log=1.6931, chased=1.6931 (rounded)

For D2 "the dog sat on the log": raw TF = the:2, dog:1, sat:1, on:1, log:1.

Unnormalized TF-IDF (tf * idf):

  • the: 2.0000
  • dog: 1.2877
  • sat: 1.2877
  • on: 1.2877
  • log: 1.6931

L2 norm β‰ˆ sqrt(4 + 3Γ—1.2877^2 + 1.6931^2) β‰ˆ 3.4425.

L2-normalized weights:

  • the: 0.5812
  • dog: 0.3742
  • sat: 0.3742
  • on: 0.3742
  • log: 0.4920
Example 2 β€” Why n-grams can help

Text: "new york is big". Unigrams miss the phrase. With bigrams, the feature "new york" gets its own weight, capturing the entity rather than two separate words.

  • Unigrams: new, york, is, big
  • Bigrams: new york, york is, is big
  • TF-IDF can assign a strong weight to "new york" if it’s informative across the corpus.
Example 3 β€” Using cosine similarity for relevance

Query: "red apple"

Docs:

  1. D1: "green apple salad"
  2. D2: "red apple pie"
  3. D3: "ripe banana"

After TF-IDF + L2, the cosine similarity between the query vector and D2 will be highest, because of the shared informative terms "red" and "apple". This is a simple baseline for search ranking.

Step-by-step: implementing TF-IDF on a small dataset

  1. Collect a small corpus of documents and lowercase them.
  2. Tokenize: split on whitespace and strip punctuation.
  3. Filter: optionally remove stopwords; set min_df and max_df.
  4. Build vocabulary (unigrams and optionally bigrams).
  5. Count term frequencies per document (raw counts or log-scaled).
  6. Compute IDF with smoothing: idf(t) = ln((1+N)/(1+df(t))) + 1.
  7. Compute TF-IDF: multiply TF by IDF.
  8. Normalize each document vector (L2).
  9. Evaluate with a simple model (logistic regression/SVM) or use cosine similarity for search-like tasks.
Mini tasks
  • Toggle stopword removal and note how "the" weight changes.
  • Try unigrams vs. unigrams+bigrams and compare validation accuracy.
  • Cap max_features (e.g., 5,000) and check speed vs. performance.

Exercises (you can do these offline)

These mirror the exercises below. Do them first, then open the solutions.

  • Exercise 1: Compute smoothed IDF and TF-IDF (with and without L2 normalization) for a tiny corpus by hand.
  • Exercise 2: Apply min_df, max_df, and 1–2 grams to decide which features remain; compute normalized TF-IDF for one document.
  • Checklist to self-verify:
    • You used smoothed IDF and showed df counts.
    • You reported both raw and L2-normalized vectors where asked.
    • You clearly listed the surviving vocabulary for Exercise 2.

Common mistakes and how to self-check

  • Forgetting normalization: Without L2, cosine similarity and linear models can be skewed by document length. Self-check: compute norms; they should be 1.0 after normalization.
  • Dropping useful rare phrases by mistake: An aggressive min_df can remove signal. Self-check: inspect top features per class and ensure domain terms remain.
  • Leaking test data into IDF: Fit vocabulary/IDF on training only. Self-check: confirm that vectorizer is fit only on the train split.
  • Overusing stopword lists: Some common words are informative in certain domains. Self-check: compare accuracy with max_df vs. stopwords removal.
  • Ignoring n-grams where phrases matter: If your domain relies on phrases (e.g., named entities), include bigrams. Self-check: run A/B with bigrams and evaluate.

Practical projects

  • Spam vs. ham classifier: TF-IDF (1–2 grams) + logistic regression. Report F1 and most informative features.
  • Simple search engine: Build TF-IDF vectors for a set of articles; rank articles by cosine similarity to user queries.
  • News clustering: TF-IDF + KMeans. Inspect cluster keywords to label clusters.

Who this is for

  • Beginner to intermediate Data Scientists who need a strong, fast text baseline.
  • Engineers moving into ML who want interpretable text features.

Prerequisites

  • Basic Python or similar (for implementation).
  • Understanding of vectors, dot product, and cosine similarity.
  • Familiarity with train/validation/test splits.

Learning path

  • Start here: TF-IDF basics and hands-on practice.
  • Then: Regularization and linear models for text (logistic regression/SVM).
  • Next: More advanced text features (character n-grams, hashing trick).
  • Later: Neural approaches (embeddings, transformers) for complex tasks.

Next steps

  • Try your TF-IDF pipeline on a real dataset (reviews, tickets, emails).
  • Tune min_df, max_df, n-grams, and sublinear TF; keep notes on what moves metrics.
  • Scroll to the Quick Test at the end. Your test is available to everyone; progress saving is available if you are logged in.

Mini challenge

Given short product reviews, create a 1–2 gram TF-IDF model with L2 normalization, train a logistic regression classifier for positive/negative sentiment, and report:

  • Validation accuracy and F1.
  • Top 10 positive and negative n-grams by weight.
  • One change that improved performance the most (and why).

Practice Exercises

2 exercises to complete

Instructions

Corpus (N=3):

  1. D1: "the cat sat on the mat"
  2. D2: "the dog sat on the log"
  3. D3: "the cat chased the dog"

Tasks:

  • Compute df for each unique term.
  • Use smoothed IDF: idf(t) = ln((1+N)/(1+df(t))) + 1 with N=3.
  • For D2, compute unnormalized TF-IDF using raw TF.
  • L2-normalize D2's vector and report weights (rounded to 4 decimals).
Expected Output
A table or list showing df, idf for each term; D2 unnormalized TF-IDF values; and D2 L2-normalized weights approximately: the=0.5812, dog=0.3742, sat=0.3742, on=0.3742, log=0.4920.

Have questions about Text Features Tf Idf Basics?

AI Assistant

Ask questions about this tool