Why this matters
Sentence embeddings turn text into vectors that capture meaning. As an NLP Engineer, you will use them to:
- Build semantic search and retrieval for RAG (retrieve relevant passages for a model to answer questions).
- Cluster and deduplicate documents or FAQs by meaning, not just keywords.
- Route intents in support tickets, emails, or chat conversations.
- Find similar items: product recommendations, similar news, or duplicate bug reports.
Concept explained simply
A sentence embedding is a fixed-length numeric vector that represents the meaning of a sentence. Similar sentences have vectors that point in similar directions.
Mental model
Imagine a giant semantic map. Each sentence is a point. Sentences about the same idea are near each other; unrelated ones are far apart. When you normalize vectors, they lie on the surface of a unit sphere, and comparing angles (cosine similarity) tells you how related they are.
Cosine vs dot product — quick intuition
- Cosine similarity compares direction only. Works best when vectors are L2-normalized.
- Dot product includes magnitude. Use it if your model was trained with dot-product scoring or you intentionally keep magnitude information.
Key pieces you should know
- Pooling: Turn token-level outputs into one sentence vector. Common choices: mean pooling (with attention mask), CLS pooling (use [CLS] token), or max pooling.
- Normalization: L2-normalize vectors if you plan to use cosine similarity. This makes scoring stable across sentences.
- Dimensions: Typical sizes range ~256–1024. Higher dimensions can improve expressiveness but increase memory and compute.
- Domain: General-purpose models work well for broad semantic tasks; domain-specific models (e.g., legal, medical) can perform better on specialized text.
- Consistency: Use the same embedding model and preprocessing for queries and documents.
Workflow: creating and using sentence embeddings
- Choose a model
Pick a model tuned for sentence similarity and your domain/language. - Preprocess
Consistent casing, punctuation handling, and tokenization. Chunk long documents into passages (e.g., 200–400 tokens) with small overlaps. - Encode
Get token-level outputs and apply pooling (mean with attention mask is a strong default). - Normalize
L2-normalize embeddings if using cosine similarity. - Index
Store vectors in a vector index (FAISS, HNSW, or built-in engine). Keep IDs to map back to text. - Retrieve
For a query, compute its embedding, then return top-k nearest neighbors (and optionally apply a similarity threshold). - Evaluate
Use labeled data to compute Recall@k, MRR, or NDCG. Manually spot-check edge cases. - Optimize
Tune chunk size/overlap, switch similarity (cosine/dot) to match training, or try a domain model. Consider dimension reduction if memory is tight.
Worked examples
Example 1 — Mean pooling with attention mask
Suppose a token-level encoder produces 4 token vectors for a sentence, but the 4th is padding:
- Token vectors: t1=(1, 2, 0), t2=(0, 1, 1), t3=(1, 1, 0), t4=(0, 0, 0)
- Attention mask: [1, 1, 1, 0]
Mean pooling over real tokens only:
mean = ( (1+0+1)/3, (2+1+1)/3, (0+1+0)/3 ) = (0.6667, 1.3333, 0.3333 )
L2-normalize (optional but recommended for cosine):
||v|| ≈ sqrt(0.4444 + 1.7778 + 0.1111) ≈ 1.5275; normalized ≈ (0.436, 0.873, 0.218)
Example 2 — Cosine similarity by hand
Embeddings (already normalized for clarity):
- a = (0.577, 0.577, 0.577)
- b = (0.436, 0.873, 0.218)
cos(a, b) = a · b ≈ 0.577*0.436 + 0.577*0.873 + 0.577*0.218 ≈ 0.88
Interpretation: strong semantic similarity.
Example 3 — Retrieval with top-k and threshold
Query q (unit length): (0.8, 0.6)
Docs (unit length): d1=(0.7071,0.7071), d2=(1,0), d3=(0,1), d4=(0.6,0.8)
- cos(q,d1)≈0.99
- cos(q,d2)=0.8
- cos(q,d3)=0.6
- cos(q,d4)=0.96
Top-2 → [d1, d4]. If threshold=0.7, returned set would be [d1, d4, d2]; but with k=2 we keep [d1, d4].
Exercises (hands-on)
Work through the exercises below, then check your answers. Use the checklist to self-verify your process.
- [ ] Exercise 1: Pooling and cosine similarity for two sentences (mean-pool, normalize, compute similarity).
- [ ] Exercise 2: Rank four document embeddings for a query using cosine similarity; return top-2 with a threshold.
Self-check checklist
- [ ] Used attention mask during pooling (ignored padding).
- [ ] L2-normalized embeddings before cosine similarity.
- [ ] Applied both top-k and threshold correctly (top-k after scoring; threshold as a cutoff).
- [ ] Wrote down numeric steps, not just the final answer.
Common mistakes and how to catch them
- Forgetting normalization: Cosine scores drift if you skip L2-normalization (when the model expects it). Fix: Always normalize for cosine.
- Pooling over padding: Including padded tokens skews the mean. Fix: Use attention masks in pooling.
- Mixing models: Using one model for documents and another for queries hurts similarity. Fix: Keep the same model and preprocessing.
- Overly large chunks: Long passages dilute meaning. Fix: Aim for manageable chunk size with overlap to preserve context continuity.
- Threshold too high: You get empty results. Fix: Tune with validation queries; start around 0.6–0.8 for cosine and adjust.
- Confusing cosine with dot product: Choose what the model was trained for, and be consistent.
Quick self-audit
- Do your positive pairs rank above negatives in a small labeled set?
- Are scores stable if you shuffle document order? (If not, check indexing logic.)
- Do near-duplicate sentences actually have high similarity? If not, check normalization and pooling.
Practical projects
- Semantic FAQ search: Encode FAQs and user questions; return top-3 answers with similarity scores and show supporting passages.
- Ticket intent routing: Map incoming emails to predefined intents using nearest-neighbor lookup over intent labels.
- Duplicate detection: Cluster product reviews or bug reports by embedding similarity and flag near-duplicates.
- RAG for internal docs: Chunk documents, index embeddings, retrieve top-k passages, and feed them into a downstream model for answer synthesis.
Mini challenge
You have 100 short product descriptions and 20 user queries. Build a quick evaluation: for each query, mark 1–3 relevant products, then measure Recall@3. Try two pooling strategies (mean vs CLS) and report which yields higher Recall@3. What changed and why?
Who this is for
- NLP Engineers and Data Scientists building search, RAG, or clustering pipelines.
- ML Engineers integrating semantic matching into products.
- Analysts prototyping meaning-based retrieval without full model training.
Prerequisites
- Comfort with vectors and basic linear algebra (dot product, norms).
- Familiarity with tokenization and transformer outputs.
- Basic understanding of nearest-neighbor search concepts.
Learning path
- Before: Text preprocessing and tokenization → Transformer basics → Vector similarity metrics.
- Now: Creating sentence embeddings (this lesson) → Indexing and approximate nearest neighbor search.
- After: Retrieval-augmented generation (RAG) → Evaluation and optimization → Domain adaptation or fine-tuning.
Next steps
- Run a small experiment comparing pooling strategies on your data.
- Establish a validation set of queries and relevant documents for ongoing tuning.
- Integrate embeddings into your retrieval stack and measure Recall@k regularly.
Quick Test
Take the Quick Test below to check your understanding. Everyone can take the test; only logged-in users get saved progress.