Why this matters
Embeddings turn words, sentences, and documents into vectors so that meaning becomes searchable, comparable, and clusterable. As an NLP Engineer you will use embeddings to:
- Build semantic search and retrieval for RAG systems.
- Detect duplicates and near-duplicates in user content.
- Cluster support tickets or reviews by topics.
- Do zero-shot/label-embedding classification and intent detection.
- Create content-based recommendations when no user history exists.
Quick note: The quick test is available to everyone; if you log in, your progress is saved.
Concept explained simply
An embedding is a list of numbers (a vector) that captures the meaning of text. Texts with similar meaning have vectors that point in similar directions. We compare them using cosine similarity (how aligned two vectors are). Bigger cosine means more similar.
Mental model
Imagine a map of meaning. Every text gets an address on that map. Similar texts live close; unrelated texts are far. Searching for similar meaning is just finding the nearest addresses.
Core ideas you’ll use daily
- Vector length (norm): how long the arrow is. Often we normalize to length 1.
- Cosine similarity: the dot product of two unit vectors. Range: -1 to 1. Higher is more similar.
- Distance vs similarity: pick one metric and be consistent.
- Dimensionality: typical sizes are a few hundred to a few thousand. Higher is not always better.
How cosine similarity works (no heavy math)
Cosine similarity measures the angle between two vectors. If both vectors are unit length (normalized), cosine similarity equals their dot product. If they point the same way, similarity is 1; if opposite, -1; if orthogonal, ~0.
Practical toolbox
- Normalize vectors: divide by their length so comparisons are fair.
- Choose a metric: cosine similarity on unit vectors is a good default.
- Index vectors: store vectors with their item IDs; you’ll later fetch nearest neighbors for search and clustering.
- Aggregate: for paragraphs/docs, average sentence embeddings or use a provided document embedding. Keep it consistent.
- Thresholds: pick a similarity cutoff by sampling pairs and inspecting matches. Start around 0.75–0.85 for English short-text, then tune.
Choosing and evaluating embeddings
- General vs domain-specific: domain models can capture jargon better. If coverage is poor, general models may win.
- Multilingual needs: use multilingual embeddings if you expect cross-language similarity.
- Zero-shot labels: embed class labels or short label descriptions; compare text to each label vector; pick the top similarity.
- Evaluate quickly:
- Create a small set of positive pairs (same meaning) and negative pairs (different meaning).
- Compute similarities and plot a simple threshold that separates positives from negatives.
- Track precision@k for search and clustering purity for grouping tasks.
Worked examples
Example 1 — Cosine similarity by hand
Vectors: A = [3, 4], B = [4, 3].
- ||A|| = 5 → Â = [0.6, 0.8]
- ||B|| = 5 → B̂ = [0.8, 0.6]
- cos(A, B) = 0.6×0.8 + 0.8×0.6 = 0.96
Interpretation: very similar meaning.
Example 2 — Semantic search ranking
Query vector q = [0, 1] (unit). Documents:
- d1 = [0.8, 0.6] → cos(q, d1) = 0.6
- d2 = [1, 0] → cos(q, d2) = 0.0
- d3 = [0, 1] → cos(q, d3) = 1.0
Ranking: d3 > d1 > d2.
Example 3 — Clustering support tickets
- Embed all ticket titles.
- Normalize vectors and compute pairwise similarities.
- Apply a simple clustering rule: link tickets with similarity ≥ 0.8, then form groups.
- Label clusters by the most common keywords per group.
Outcome: common issues surface (e.g., login errors vs payment failures).
Common mistakes and self-check
- Forgetting normalization: leads to length-biased results. Self-check: the average similarity across random pairs should hover around ~0 with unit vectors.
- Mixing metrics: switching between cosine and Euclidean mid-project breaks thresholds. Self-check: document your metric choice and keep it consistent.
- Over-averaging long docs: dilutes key meaning. Self-check: compare search quality with paragraph-level vs full-doc averages.
- Too tight thresholds: miss good matches. Self-check: manually review 20 borderline pairs and adjust.
- Forgetting ID mapping: losing which vector maps to which item. Self-check: always store (id, vector) and keep a stable index.
Exercises
These match the graded exercises below. Do them here first, then submit in the exercise section.
Exercise 1 — Cosine warm-up (mirrors ex1)
Given vectors A = [3, 4] and B = [4, 3]:
- Compute normalized vectors  and B̂ (3 decimals).
- Compute cosine similarity cos(A, B) (2 decimals).
Hint
Normalize by dividing each vector by its length; dot product is sum of element-wise products.
Show solution
 = [0.600, 0.800], B̂ = [0.800, 0.600], cosine ≈ 0.96
Exercise 2 — Rank documents (mirrors ex2)
Query q = [0, 1]. Documents: d1 = [0.8, 0.6], d2 = [1, 0], d3 = [0, 1]. Rank from most to least similar using cosine.
Hint
Cosine with [0,1] is just the second coordinate if vectors are unit length.
Show solution
Order: d3 > d1 > d2
Exercise completion checklist
- [ ] I can normalize any 2D vector and compute cosine similarity.
- [ ] I can rank vectors by similarity to a query.
- [ ] I can explain why normalization matters in one sentence.
Practical projects
- Semantic FAQ search: embed questions and answers; search by user query; show top-3 answers.
- Duplicate review detection: flag pairs above a tuned similarity threshold.
- Zero-shot intent classifier: embed intents (short label descriptions) and pick the most similar label for each message.
- Cold-start recommendations: embed products by title+description; recommend nearest neighbors.
Who this is for
- Aspiring NLP Engineers building search, RAG, and classification systems.
- Data Scientists needing semantic similarity and clustering.
- MLOps/Engineers wiring retrieval into production pipelines.
Prerequisites
- Basic linear algebra (vectors, dot product, norms).
- Comfort with Python or another language for arrays/vectors.
- Familiarity with tokenization and text preprocessing.
Learning path
- Tokenization and text normalization basics.
- Embeddings concepts (this page) and cosine similarity.
- Building a semantic search index; tuning thresholds.
- Clustering and topic grouping with embeddings.
- Zero-shot classification using label embeddings.
- Evaluation: precision@k, recall, clustering quality.
Next steps
- Complete the Exercises and the Quick Test below.
- Pick one Practical project and implement an MVP.
- Record your thresholds and metrics to build intuition.
Mini challenge
You have three short product titles and a user query: "wireless earbuds". Titles:
- T1: "Bluetooth in-ear headphones"
- T2: "Over-ear wired studio headset"
- T3: "True wireless earphones with case"
Without computing actual vectors, decide which two should rank top for semantic search and why. State your reasoning in one sentence per choice.
Sample reasoning
Top 1: T3 mentions "True wireless" and "earphones" which is closest to "wireless earbuds". Top 2: T1 mentions "Bluetooth" and "in-ear", both strongly related. T2 is wired and over-ear, so it should rank last.