How to learn Similarity Measures for Feature Engineering For Classical NLP in NLP Engineer for free

Who this is for

You work with classical NLP features (bag-of-words, TF-IDF, n-grams) and need to compare texts for search, deduplication, clustering, or recommendation. Ideal for NLP Engineers, Data Scientists, and ML practitioners building text pipelines without heavy neural models.

Prerequisites

Basic Python or pseudo-code comfort (optional but helpful)
Understanding of tokenization and n-grams
Vectors 101: dot product, magnitude, normalization
TF and TF-IDF intuition

Why this matters

Search ranking: compare a query vector to document vectors.
Duplicate detection: find near-identical titles or questions.
Clustering: group similar support tickets or reviews.
Recommenders: show similar articles or products based on descriptions.

Getting the similarity measure right can boost relevance and reduce false matches with just a few lines of feature engineering.

Concept explained simply

A similarity measure converts two text representations into a single score: higher means more alike. The representation could be word counts, TF-IDF, or character n-grams. The measure you choose affects how length, rare words, and typos influence the score.

Mental model

Cosine similarity: compares angles between vectors; ignores overall length. Great for TF-IDF search.
Jaccard: intersection over union of sets; rewards unique overlap; strict for short texts.
Dice (Sørensen–Dice): like Jaccard but more forgiving; common for character n-grams and fuzzy matching.
Euclidean/Manhattan distance: raw distance in space; sensitive to length and scale; use after normalization.

Core formulas quickly

Cosine similarity: dot(A,B) divided by (||A|| times ||B||). Range: -1 to 1 in general; with non-negative TF/TF-IDF it is 0 to 1.
Jaccard similarity (sets): size of intersection divided by size of union.
Dice coefficient (sets): 2 times intersection size divided by (size of set A plus size of set B). For multisets, use sum of minimum counts.
Euclidean distance: square root of the sum of squared differences. Often convert to a similarity like 1 divided by (1 plus distance) if needed.

Worked examples

Example 1: Jaccard on word sets

Text A: "nlp similarity measures basics" → {nlp, similarity, measures, basics}

Text B: "similarity basics for nlp" → {similarity, basics, for, nlp}

Intersection: {nlp, similarity, basics} → size 3
Union: {nlp, similarity, basics, measures, for} → size 5
Jaccard = 3 / 5 = 0.60

Example 2: Cosine with simple term counts

Vocabulary: [nlp, similarity, measures, basics]

A counts: [1,1,1,1], B counts: [1,1,0,1]

dot(A,B) = 1+1+0+1 = 3
||A|| = sqrt(1+1+1+1) = 2
||B|| = sqrt(1+1+0+1) = sqrt(3) ≈ 1.732
Cosine ≈ 3 / (2 × 1.732) ≈ 0.866

Note: Cosine scores higher than Jaccard here because it is less penalizing for missing one term.

Example 3: Dice on character bigrams (fuzzy matching)

Strings: "color" vs "colour"

Bigrams (with simple sliding):

color → co, ol, lo, or
colour → co, ol, lo, ou, ur

Intersection size = 3 (co, ol, lo)
Dice = 2 × 3 / (4 + 5) = 6 / 9 = 0.667

Good for spelling variants.

Example 4: Euclidean vs Cosine

A = [2, 0], B = [4, 0]

Euclidean distance = sqrt((2-4)^2 + 0) = 2 → if converted to similarity 1/(1+2)=0.333
Cosine = (2×4 + 0) / (||A|| × ||B||) = 8 / (2 × 4) = 1

Cosine says "same direction" (identical terms, different length) → very similar. Euclidean penalizes scale.

Choosing the right measure

Search with TF-IDF: Cosine similarity.
Short texts with typos or variants: Character n-gram Dice.
Strict overlap on tags/keywords: Jaccard on sets.
Numeric embeddings with standardized scale: Cosine for direction; Euclidean if magnitude matters.

Quick decision checklist

Are vectors non-negative and sparse? Prefer cosine.
Need exact overlap behavior? Jaccard.
Expect spelling noise? Character n-grams + Dice.
Very short strings? N-grams often beat word overlap.

Thresholding and evaluation

Pick a metric, compute scores on labeled pairs, and draw a simple table of thresholds versus precision/recall.
Start with thresholds: cosine 0.7–0.9, Jaccard 0.3–0.6 for short texts, Dice 0.6–0.8 for char n-grams. Tune per data.
For ranking, you may not need a hard threshold—sort by score.

Exercises

Complete the exercises below, then check your work. The quick test is available to everyone; log in to save your progress.

Exercise 1: Compute cosine similarity between two short sentences using bag-of-words after lowercasing and removing stopwords.
Exercise 2: Compute Jaccard similarity between two keyword sets and decide if they are duplicates at threshold 0.5.
Exercise 3: Compute character trigram Dice similarity for two product names with minor spelling differences.

Exercise checklist

Preprocessed consistently (lowercase, same tokenization)
Vectors built from the same vocabulary
Handled empty/zero vectors safely
Rounded results to 3 decimals

Common mistakes and self-check

Mixing spaces: Comparing TF counts to TF-IDF vectors. Self-check: confirm both vectors use the same weighting.
No normalization for Euclidean: Long texts dominate. Self-check: test with scaled duplicates; cosine should be high, Euclidean-based similarity should not penalize length after normalization.
Stopwords inflate similarity: Remove or downweight them; re-run and check if unrelated pairs drop in score.
Short text bias with Jaccard: Low overlap even when related. Try character n-grams or Dice and compare.
Zero vectors: If both texts become empty after filtering, define similarity as 0 or skip; do not divide by zero.

Practical projects

Duplicate question finder: TF-IDF + cosine, tune threshold with a small labeled set.
Fuzzy product dedup: character trigrams + Dice to catch spelling variants.
Mini search engine: index titles with TF-IDF, rank by cosine, add tie-breaker rules.
Ticket clustering: convert to TF-IDF, compute pairwise cosine, cluster with a simple algorithm and inspect clusters.

Learning path

Before: Tokenization, stopword handling, stemming/lemmatization, TF/TF-IDF.
Now: Similarity measures (this lesson) with bag-of-words and n-grams.
Next: Dimensionality reduction (e.g., truncated SVD for sparse vectors), approximate nearest neighbors for scaling, and evaluation techniques for ranking.

Next steps

Implement cosine and Jaccard on a small text set.
Collect 20 positive and 20 negative pairs; pick a threshold based on F1.
Try replacing words with character trigrams and compare Dice vs Jaccard.

Mini challenge

You have 1,000 product titles and want to flag near-duplicates. Time budget: 1 hour. Choose a representation and a similarity measure, justify your choice in two sentences, and propose an initial threshold. Then outline 3 steps to validate and tune it. Keep it practical and simple.

Note: The quick test below is available to everyone. Log in to save your progress and resume later.

Menu

Similarity Measures

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core formulas quickly

Worked examples

Choosing the right measure

Thresholding and evaluation

Exercises

Exercise checklist

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Cosine on bag-of-words

Instructions

Expected Output

Jaccard on keyword sets

Dice on character trigrams

Similarity Measures — Quick Test

Have questions about Similarity Measures?

AI Assistant