luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Creating Sentence Embeddings

Learn Creating Sentence Embeddings for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Sentence embeddings turn text into vectors that capture meaning. As an NLP Engineer, you will use them to:

  • Build semantic search and retrieval for RAG (retrieve relevant passages for a model to answer questions).
  • Cluster and deduplicate documents or FAQs by meaning, not just keywords.
  • Route intents in support tickets, emails, or chat conversations.
  • Find similar items: product recommendations, similar news, or duplicate bug reports.

Concept explained simply

A sentence embedding is a fixed-length numeric vector that represents the meaning of a sentence. Similar sentences have vectors that point in similar directions.

Mental model

Imagine a giant semantic map. Each sentence is a point. Sentences about the same idea are near each other; unrelated ones are far apart. When you normalize vectors, they lie on the surface of a unit sphere, and comparing angles (cosine similarity) tells you how related they are.

Cosine vs dot product — quick intuition
  • Cosine similarity compares direction only. Works best when vectors are L2-normalized.
  • Dot product includes magnitude. Use it if your model was trained with dot-product scoring or you intentionally keep magnitude information.

Key pieces you should know

  • Pooling: Turn token-level outputs into one sentence vector. Common choices: mean pooling (with attention mask), CLS pooling (use [CLS] token), or max pooling.
  • Normalization: L2-normalize vectors if you plan to use cosine similarity. This makes scoring stable across sentences.
  • Dimensions: Typical sizes range ~256–1024. Higher dimensions can improve expressiveness but increase memory and compute.
  • Domain: General-purpose models work well for broad semantic tasks; domain-specific models (e.g., legal, medical) can perform better on specialized text.
  • Consistency: Use the same embedding model and preprocessing for queries and documents.

Workflow: creating and using sentence embeddings

  1. Choose a model
    Pick a model tuned for sentence similarity and your domain/language.
  2. Preprocess
    Consistent casing, punctuation handling, and tokenization. Chunk long documents into passages (e.g., 200–400 tokens) with small overlaps.
  3. Encode
    Get token-level outputs and apply pooling (mean with attention mask is a strong default).
  4. Normalize
    L2-normalize embeddings if using cosine similarity.
  5. Index
    Store vectors in a vector index (FAISS, HNSW, or built-in engine). Keep IDs to map back to text.
  6. Retrieve
    For a query, compute its embedding, then return top-k nearest neighbors (and optionally apply a similarity threshold).
  7. Evaluate
    Use labeled data to compute Recall@k, MRR, or NDCG. Manually spot-check edge cases.
  8. Optimize
    Tune chunk size/overlap, switch similarity (cosine/dot) to match training, or try a domain model. Consider dimension reduction if memory is tight.

Worked examples

Example 1 — Mean pooling with attention mask

Suppose a token-level encoder produces 4 token vectors for a sentence, but the 4th is padding:

  • Token vectors: t1=(1, 2, 0), t2=(0, 1, 1), t3=(1, 1, 0), t4=(0, 0, 0)
  • Attention mask: [1, 1, 1, 0]

Mean pooling over real tokens only:

mean = ( (1+0+1)/3, (2+1+1)/3, (0+1+0)/3 ) = (0.6667, 1.3333, 0.3333 )

L2-normalize (optional but recommended for cosine):

||v|| ≈ sqrt(0.4444 + 1.7778 + 0.1111) ≈ 1.5275; normalized ≈ (0.436, 0.873, 0.218)

Example 2 — Cosine similarity by hand

Embeddings (already normalized for clarity):

  • a = (0.577, 0.577, 0.577)
  • b = (0.436, 0.873, 0.218)

cos(a, b) = a · b ≈ 0.577*0.436 + 0.577*0.873 + 0.577*0.218 ≈ 0.88

Interpretation: strong semantic similarity.

Example 3 — Retrieval with top-k and threshold

Query q (unit length): (0.8, 0.6)

Docs (unit length): d1=(0.7071,0.7071), d2=(1,0), d3=(0,1), d4=(0.6,0.8)

  • cos(q,d1)≈0.99
  • cos(q,d2)=0.8
  • cos(q,d3)=0.6
  • cos(q,d4)=0.96

Top-2 → [d1, d4]. If threshold=0.7, returned set would be [d1, d4, d2]; but with k=2 we keep [d1, d4].

Exercises (hands-on)

Work through the exercises below, then check your answers. Use the checklist to self-verify your process.

  • [ ] Exercise 1: Pooling and cosine similarity for two sentences (mean-pool, normalize, compute similarity).
  • [ ] Exercise 2: Rank four document embeddings for a query using cosine similarity; return top-2 with a threshold.
Self-check checklist
  • [ ] Used attention mask during pooling (ignored padding).
  • [ ] L2-normalized embeddings before cosine similarity.
  • [ ] Applied both top-k and threshold correctly (top-k after scoring; threshold as a cutoff).
  • [ ] Wrote down numeric steps, not just the final answer.

Common mistakes and how to catch them

  • Forgetting normalization: Cosine scores drift if you skip L2-normalization (when the model expects it). Fix: Always normalize for cosine.
  • Pooling over padding: Including padded tokens skews the mean. Fix: Use attention masks in pooling.
  • Mixing models: Using one model for documents and another for queries hurts similarity. Fix: Keep the same model and preprocessing.
  • Overly large chunks: Long passages dilute meaning. Fix: Aim for manageable chunk size with overlap to preserve context continuity.
  • Threshold too high: You get empty results. Fix: Tune with validation queries; start around 0.6–0.8 for cosine and adjust.
  • Confusing cosine with dot product: Choose what the model was trained for, and be consistent.
Quick self-audit
  • Do your positive pairs rank above negatives in a small labeled set?
  • Are scores stable if you shuffle document order? (If not, check indexing logic.)
  • Do near-duplicate sentences actually have high similarity? If not, check normalization and pooling.

Practical projects

  • Semantic FAQ search: Encode FAQs and user questions; return top-3 answers with similarity scores and show supporting passages.
  • Ticket intent routing: Map incoming emails to predefined intents using nearest-neighbor lookup over intent labels.
  • Duplicate detection: Cluster product reviews or bug reports by embedding similarity and flag near-duplicates.
  • RAG for internal docs: Chunk documents, index embeddings, retrieve top-k passages, and feed them into a downstream model for answer synthesis.

Mini challenge

You have 100 short product descriptions and 20 user queries. Build a quick evaluation: for each query, mark 1–3 relevant products, then measure Recall@3. Try two pooling strategies (mean vs CLS) and report which yields higher Recall@3. What changed and why?

Who this is for

  • NLP Engineers and Data Scientists building search, RAG, or clustering pipelines.
  • ML Engineers integrating semantic matching into products.
  • Analysts prototyping meaning-based retrieval without full model training.

Prerequisites

  • Comfort with vectors and basic linear algebra (dot product, norms).
  • Familiarity with tokenization and transformer outputs.
  • Basic understanding of nearest-neighbor search concepts.

Learning path

  • Before: Text preprocessing and tokenization → Transformer basics → Vector similarity metrics.
  • Now: Creating sentence embeddings (this lesson) → Indexing and approximate nearest neighbor search.
  • After: Retrieval-augmented generation (RAG) → Evaluation and optimization → Domain adaptation or fine-tuning.

Next steps

  • Run a small experiment comparing pooling strategies on your data.
  • Establish a validation set of queries and relevant documents for ongoing tuning.
  • Integrate embeddings into your retrieval stack and measure Recall@k regularly.

Quick Test

Take the Quick Test below to check your understanding. Everyone can take the test; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

You have two sentences, A and B. Their token embeddings (3D) and attention masks are:

  • A tokens: t1=(1,2,0), t2=(0,1,1), t3=(1,1,0), t4=(0,0,0); mask=[1,1,1,0]
  • B tokens: u1=(1,0,1), u2=(0,1,0); mask=[1,1]
  1. Compute mean-pooled embeddings for A and B using the masks (ignore padding).
  2. L2-normalize both sentence vectors.
  3. Compute cosine similarity between the normalized vectors.
Expected Output
A_normalized ≈ (0.436, 0.873, 0.218), B_normalized ≈ (0.577, 0.577, 0.577), cosine ≈ 0.88

Creating Sentence Embeddings — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Creating Sentence Embeddings?

AI Assistant

Ask questions about this tool