Who this is for
You work with classical NLP features (bag-of-words, TF-IDF, n-grams) and need to compare texts for search, deduplication, clustering, or recommendation. Ideal for NLP Engineers, Data Scientists, and ML practitioners building text pipelines without heavy neural models.
Prerequisites
- Basic Python or pseudo-code comfort (optional but helpful)
- Understanding of tokenization and n-grams
- Vectors 101: dot product, magnitude, normalization
- TF and TF-IDF intuition
Why this matters
- Search ranking: compare a query vector to document vectors.
- Duplicate detection: find near-identical titles or questions.
- Clustering: group similar support tickets or reviews.
- Recommenders: show similar articles or products based on descriptions.
Getting the similarity measure right can boost relevance and reduce false matches with just a few lines of feature engineering.
Concept explained simply
A similarity measure converts two text representations into a single score: higher means more alike. The representation could be word counts, TF-IDF, or character n-grams. The measure you choose affects how length, rare words, and typos influence the score.
Mental model
- Cosine similarity: compares angles between vectors; ignores overall length. Great for TF-IDF search.
- Jaccard: intersection over union of sets; rewards unique overlap; strict for short texts.
- Dice (Sørensen–Dice): like Jaccard but more forgiving; common for character n-grams and fuzzy matching.
- Euclidean/Manhattan distance: raw distance in space; sensitive to length and scale; use after normalization.
Core formulas quickly
- Cosine similarity: dot(A,B) divided by (||A|| times ||B||). Range: -1 to 1 in general; with non-negative TF/TF-IDF it is 0 to 1.
- Jaccard similarity (sets): size of intersection divided by size of union.
- Dice coefficient (sets): 2 times intersection size divided by (size of set A plus size of set B). For multisets, use sum of minimum counts.
- Euclidean distance: square root of the sum of squared differences. Often convert to a similarity like 1 divided by (1 plus distance) if needed.
Worked examples
Example 1: Jaccard on word sets
Text A: "nlp similarity measures basics" → {nlp, similarity, measures, basics}
Text B: "similarity basics for nlp" → {similarity, basics, for, nlp}
- Intersection: {nlp, similarity, basics} → size 3
- Union: {nlp, similarity, basics, measures, for} → size 5
- Jaccard = 3 / 5 = 0.60
Example 2: Cosine with simple term counts
Vocabulary: [nlp, similarity, measures, basics]
A counts: [1,1,1,1], B counts: [1,1,0,1]
- dot(A,B) = 1+1+0+1 = 3
- ||A|| = sqrt(1+1+1+1) = 2
- ||B|| = sqrt(1+1+0+1) = sqrt(3) ≈ 1.732
- Cosine ≈ 3 / (2 × 1.732) ≈ 0.866
Note: Cosine scores higher than Jaccard here because it is less penalizing for missing one term.
Example 3: Dice on character bigrams (fuzzy matching)
Strings: "color" vs "colour"
Bigrams (with simple sliding):
- color → co, ol, lo, or
- colour → co, ol, lo, ou, ur
- Intersection size = 3 (co, ol, lo)
- Dice = 2 × 3 / (4 + 5) = 6 / 9 = 0.667
Good for spelling variants.
Example 4: Euclidean vs Cosine
A = [2, 0], B = [4, 0]
- Euclidean distance = sqrt((2-4)^2 + 0) = 2 → if converted to similarity 1/(1+2)=0.333
- Cosine = (2×4 + 0) / (||A|| × ||B||) = 8 / (2 × 4) = 1
Cosine says "same direction" (identical terms, different length) → very similar. Euclidean penalizes scale.
Choosing the right measure
- Search with TF-IDF: Cosine similarity.
- Short texts with typos or variants: Character n-gram Dice.
- Strict overlap on tags/keywords: Jaccard on sets.
- Numeric embeddings with standardized scale: Cosine for direction; Euclidean if magnitude matters.
Quick decision checklist
- Are vectors non-negative and sparse? Prefer cosine.
- Need exact overlap behavior? Jaccard.
- Expect spelling noise? Character n-grams + Dice.
- Very short strings? N-grams often beat word overlap.
Thresholding and evaluation
- Pick a metric, compute scores on labeled pairs, and draw a simple table of thresholds versus precision/recall.
- Start with thresholds: cosine 0.7–0.9, Jaccard 0.3–0.6 for short texts, Dice 0.6–0.8 for char n-grams. Tune per data.
- For ranking, you may not need a hard threshold—sort by score.
Exercises
Complete the exercises below, then check your work. The quick test is available to everyone; log in to save your progress.
- Exercise 1: Compute cosine similarity between two short sentences using bag-of-words after lowercasing and removing stopwords.
- Exercise 2: Compute Jaccard similarity between two keyword sets and decide if they are duplicates at threshold 0.5.
- Exercise 3: Compute character trigram Dice similarity for two product names with minor spelling differences.
Exercise checklist
- Preprocessed consistently (lowercase, same tokenization)
- Vectors built from the same vocabulary
- Handled empty/zero vectors safely
- Rounded results to 3 decimals
Common mistakes and self-check
- Mixing spaces: Comparing TF counts to TF-IDF vectors. Self-check: confirm both vectors use the same weighting.
- No normalization for Euclidean: Long texts dominate. Self-check: test with scaled duplicates; cosine should be high, Euclidean-based similarity should not penalize length after normalization.
- Stopwords inflate similarity: Remove or downweight them; re-run and check if unrelated pairs drop in score.
- Short text bias with Jaccard: Low overlap even when related. Try character n-grams or Dice and compare.
- Zero vectors: If both texts become empty after filtering, define similarity as 0 or skip; do not divide by zero.
Practical projects
- Duplicate question finder: TF-IDF + cosine, tune threshold with a small labeled set.
- Fuzzy product dedup: character trigrams + Dice to catch spelling variants.
- Mini search engine: index titles with TF-IDF, rank by cosine, add tie-breaker rules.
- Ticket clustering: convert to TF-IDF, compute pairwise cosine, cluster with a simple algorithm and inspect clusters.
Learning path
- Before: Tokenization, stopword handling, stemming/lemmatization, TF/TF-IDF.
- Now: Similarity measures (this lesson) with bag-of-words and n-grams.
- Next: Dimensionality reduction (e.g., truncated SVD for sparse vectors), approximate nearest neighbors for scaling, and evaluation techniques for ranking.
Next steps
- Implement cosine and Jaccard on a small text set.
- Collect 20 positive and 20 negative pairs; pick a threshold based on F1.
- Try replacing words with character trigrams and compare Dice vs Jaccard.
Mini challenge
You have 1,000 product titles and want to flag near-duplicates. Time budget: 1 hour. Choose a representation and a similarity measure, justify your choice in two sentences, and propose an initial threshold. Then outline 3 steps to validate and tune it. Keep it practical and simple.
Note: The quick test below is available to everyone. Log in to save your progress and resume later.