luvv to helpDiscover the Best Free Online Tools
Topic 6 of 8

Similarity Measures

Learn Similarity Measures for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Who this is for

You work with classical NLP features (bag-of-words, TF-IDF, n-grams) and need to compare texts for search, deduplication, clustering, or recommendation. Ideal for NLP Engineers, Data Scientists, and ML practitioners building text pipelines without heavy neural models.

Prerequisites

  • Basic Python or pseudo-code comfort (optional but helpful)
  • Understanding of tokenization and n-grams
  • Vectors 101: dot product, magnitude, normalization
  • TF and TF-IDF intuition

Why this matters

  • Search ranking: compare a query vector to document vectors.
  • Duplicate detection: find near-identical titles or questions.
  • Clustering: group similar support tickets or reviews.
  • Recommenders: show similar articles or products based on descriptions.

Getting the similarity measure right can boost relevance and reduce false matches with just a few lines of feature engineering.

Concept explained simply

A similarity measure converts two text representations into a single score: higher means more alike. The representation could be word counts, TF-IDF, or character n-grams. The measure you choose affects how length, rare words, and typos influence the score.

Mental model

  • Cosine similarity: compares angles between vectors; ignores overall length. Great for TF-IDF search.
  • Jaccard: intersection over union of sets; rewards unique overlap; strict for short texts.
  • Dice (Sørensen–Dice): like Jaccard but more forgiving; common for character n-grams and fuzzy matching.
  • Euclidean/Manhattan distance: raw distance in space; sensitive to length and scale; use after normalization.

Core formulas quickly

  • Cosine similarity: dot(A,B) divided by (||A|| times ||B||). Range: -1 to 1 in general; with non-negative TF/TF-IDF it is 0 to 1.
  • Jaccard similarity (sets): size of intersection divided by size of union.
  • Dice coefficient (sets): 2 times intersection size divided by (size of set A plus size of set B). For multisets, use sum of minimum counts.
  • Euclidean distance: square root of the sum of squared differences. Often convert to a similarity like 1 divided by (1 plus distance) if needed.

Worked examples

Example 1: Jaccard on word sets

Text A: "nlp similarity measures basics" → {nlp, similarity, measures, basics}

Text B: "similarity basics for nlp" → {similarity, basics, for, nlp}

  • Intersection: {nlp, similarity, basics} → size 3
  • Union: {nlp, similarity, basics, measures, for} → size 5
  • Jaccard = 3 / 5 = 0.60
Example 2: Cosine with simple term counts

Vocabulary: [nlp, similarity, measures, basics]

A counts: [1,1,1,1], B counts: [1,1,0,1]

  • dot(A,B) = 1+1+0+1 = 3
  • ||A|| = sqrt(1+1+1+1) = 2
  • ||B|| = sqrt(1+1+0+1) = sqrt(3) ≈ 1.732
  • Cosine ≈ 3 / (2 × 1.732) ≈ 0.866

Note: Cosine scores higher than Jaccard here because it is less penalizing for missing one term.

Example 3: Dice on character bigrams (fuzzy matching)

Strings: "color" vs "colour"

Bigrams (with simple sliding):

  • color → co, ol, lo, or
  • colour → co, ol, lo, ou, ur
  • Intersection size = 3 (co, ol, lo)
  • Dice = 2 × 3 / (4 + 5) = 6 / 9 = 0.667

Good for spelling variants.

Example 4: Euclidean vs Cosine

A = [2, 0], B = [4, 0]

  • Euclidean distance = sqrt((2-4)^2 + 0) = 2 → if converted to similarity 1/(1+2)=0.333
  • Cosine = (2×4 + 0) / (||A|| × ||B||) = 8 / (2 × 4) = 1

Cosine says "same direction" (identical terms, different length) → very similar. Euclidean penalizes scale.

Choosing the right measure

  • Search with TF-IDF: Cosine similarity.
  • Short texts with typos or variants: Character n-gram Dice.
  • Strict overlap on tags/keywords: Jaccard on sets.
  • Numeric embeddings with standardized scale: Cosine for direction; Euclidean if magnitude matters.
Quick decision checklist
  • Are vectors non-negative and sparse? Prefer cosine.
  • Need exact overlap behavior? Jaccard.
  • Expect spelling noise? Character n-grams + Dice.
  • Very short strings? N-grams often beat word overlap.

Thresholding and evaluation

  • Pick a metric, compute scores on labeled pairs, and draw a simple table of thresholds versus precision/recall.
  • Start with thresholds: cosine 0.7–0.9, Jaccard 0.3–0.6 for short texts, Dice 0.6–0.8 for char n-grams. Tune per data.
  • For ranking, you may not need a hard threshold—sort by score.

Exercises

Complete the exercises below, then check your work. The quick test is available to everyone; log in to save your progress.

  1. Exercise 1: Compute cosine similarity between two short sentences using bag-of-words after lowercasing and removing stopwords.
  2. Exercise 2: Compute Jaccard similarity between two keyword sets and decide if they are duplicates at threshold 0.5.
  3. Exercise 3: Compute character trigram Dice similarity for two product names with minor spelling differences.

Exercise checklist

  • Preprocessed consistently (lowercase, same tokenization)
  • Vectors built from the same vocabulary
  • Handled empty/zero vectors safely
  • Rounded results to 3 decimals

Common mistakes and self-check

  • Mixing spaces: Comparing TF counts to TF-IDF vectors. Self-check: confirm both vectors use the same weighting.
  • No normalization for Euclidean: Long texts dominate. Self-check: test with scaled duplicates; cosine should be high, Euclidean-based similarity should not penalize length after normalization.
  • Stopwords inflate similarity: Remove or downweight them; re-run and check if unrelated pairs drop in score.
  • Short text bias with Jaccard: Low overlap even when related. Try character n-grams or Dice and compare.
  • Zero vectors: If both texts become empty after filtering, define similarity as 0 or skip; do not divide by zero.

Practical projects

  • Duplicate question finder: TF-IDF + cosine, tune threshold with a small labeled set.
  • Fuzzy product dedup: character trigrams + Dice to catch spelling variants.
  • Mini search engine: index titles with TF-IDF, rank by cosine, add tie-breaker rules.
  • Ticket clustering: convert to TF-IDF, compute pairwise cosine, cluster with a simple algorithm and inspect clusters.

Learning path

  • Before: Tokenization, stopword handling, stemming/lemmatization, TF/TF-IDF.
  • Now: Similarity measures (this lesson) with bag-of-words and n-grams.
  • Next: Dimensionality reduction (e.g., truncated SVD for sparse vectors), approximate nearest neighbors for scaling, and evaluation techniques for ranking.

Next steps

  • Implement cosine and Jaccard on a small text set.
  • Collect 20 positive and 20 negative pairs; pick a threshold based on F1.
  • Try replacing words with character trigrams and compare Dice vs Jaccard.

Mini challenge

You have 1,000 product titles and want to flag near-duplicates. Time budget: 1 hour. Choose a representation and a similarity measure, justify your choice in two sentences, and propose an initial threshold. Then outline 3 steps to validate and tune it. Keep it practical and simple.

Note: The quick test below is available to everyone. Log in to save your progress and resume later.

Practice Exercises

3 exercises to complete

Instructions

Compute cosine similarity for the sentences:

  • A: "Machine learning for text similarity"
  • B: "Text similarity with machine learning"

Steps:

  • Lowercase and remove stopwords: for, with
  • Tokenize by spaces
  • Build a shared vocabulary and term-count vectors
  • Compute cosine similarity; round to 3 decimals
Expected Output
A cosine similarity close to 1.000 (exact value depends on stopword list, typically ≥ 0.9).

Similarity Measures — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Similarity Measures?

AI Assistant

Ask questions about this tool