How to learn Embeddings And Retrieval for NLP Engineer for free

Why Embeddings and Retrieval matter for an NLP Engineer

Embeddings convert text into vectors so similar meanings sit near each other in vector space. Retrieval uses those vectors (and sometimes sparse text signals) to quickly find relevant content. As an NLP Engineer, this skill unlocks:

Search and recommendations that understand meaning, not just keywords.
Retrieval-Augmented Generation (RAG) to ground LLM answers in your data.
De-duplication, clustering, and semantic similarity detection.
Production-grade indexing, latency control, and relevance evaluation.

Latency vs accuracy: quick guidance

Fastest: Flat or HNSW with smaller embedding models; lower accuracy.
Balanced: ANN (HNSW/IVF) + rerank top-k with a cross-encoder.
Highest quality: Hybrid dense+sparse + reranking; may trade latency.

Who this is for

NLP Engineers building semantic search, RAG, or QA systems.
Data Scientists optimizing retrieval quality and evaluation.
ML Engineers deploying vector databases and ANN indexes.

Prerequisites

Python basics (lists, dicts, virtual envs)
Linear algebra essentials (vectors, dot product, cosine)
Familiarity with NumPy and scikit-learn
Basic understanding of train/validation split and metrics

Learning path

1) Create and normalize sentence embeddings — generate vectors, L2-normalize, compute cosine similarity.

Mini task

Embed 20 sentences. Find the top-3 most similar for any query. Inspect false positives and think why they appeared.
2) Indexing options — understand Flat vs ANN (HNSW/IVF), PQ for compression, and memory trade-offs.

Mini task

Index 10k vectors with Flat and HNSW. Compare latency and Recall@10 on the same queries.
3) Similarity search and reranking — retrieve top-k with ANN; optionally rerank with a cross-encoder.

Mini task

Retrieve top-50, then rerank to top-5 with a cross-encoder. Measure MRR before vs after.
4) Chunking strategies — split long docs into chunks with overlap; attach metadata for filtering.

Mini task

Test chunk sizes 200, 400, 800 words with 10–20% overlap. Plot Recall@5 vs average latency.
5) Hybrid search — combine dense (embeddings) and sparse (TF-IDF) scores for robust retrieval.

Mini task

Blend scores: 0.7*dense + 0.3*sparse. Grid-search the weights and pick the best nDCG@10.
6) Build evaluation sets and measure metrics — create labeled data; compute Recall@k, MRR, and nDCG.

Mini task

Collect 50 query–relevant pairs with graded relevance (0–3). Compute metrics for 3 model variants.

Worked examples

Example 1 — Create sentence embeddings and compute cosine similarity

from sentence_transformers import SentenceTransformer
import numpy as np

# 1) Load a compact embedding model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

texts = [
    "How do I reset my account password?",
    "Password reset instructions",
    "Schedule a meeting for next Tuesday",
    "Troubleshooting login issues"
]

# 2) Encode and L2-normalize for cosine via dot product
emb = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)

# 3) Query and compute similarities
query = "I forgot my password"
q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
scores = (emb @ q_emb.T).ravel()  # cosine similarity because normalized

for i, s in sorted(list(enumerate(scores)), key=lambda x: -x[1]):
    print(f"{scores[i]:.3f} :: {texts[i]}")

Example 2 — Build a FAISS index and run ANN search

import faiss
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
corpus = [f"Document {i} about product support and onboarding" for i in range(10000)]
X = model.encode(corpus, convert_to_numpy=True, normalize_embeddings=True)

# Inner product index (with normalized vectors == cosine)
d = X.shape[1]
index = faiss.IndexFlatIP(d)
index.add(X)

query = "onboarding steps for new users"
q = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
scores, idx = index.search(q, k=5)

print("Top-5:")
for rank, (score, j) in enumerate(zip(scores[0], idx[0]), 1):
    print(rank, f"score={score:.3f}", corpus[j])

Example 3 — Chunk documents with overlap and metadata

def chunk_text(text, chunk_size=300, overlap=60):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk_words = words[start:end]
        chunks.append({
            "text": " ".join(chunk_words),
            "start_word": start,
            "end_word": end,
        })
        if end == len(words):
            break
        start = max(end - overlap, 0)
    return chunks

long_doc = """Welcome to the user guide... (imagine many paragraphs here) ..."""
chunks = chunk_text(long_doc, chunk_size=250, overlap=50)
print("Created", len(chunks), "chunks")

Example 4 — Hybrid search: dense + sparse TF-IDF

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
import numpy as np

corpus = [
  "Reset password with the email link.",
  "Meeting scheduler integration for calendars.",
  "Account security and login troubleshooting.",
  "Password policy requirements and length."
]

# Dense
m = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
E = m.encode(corpus, convert_to_numpy=True, normalize_embeddings=True)
q = m.encode(["forgot password"], convert_to_numpy=True, normalize_embeddings=True)
S_dense = (E @ q.T).ravel()

# Sparse TF-IDF
vec = TfidfVectorizer()
Tf = vec.fit_transform(corpus)
q_tf = vec.transform(["forgot password"])
Tf = normalize(Tf)
q_tf = normalize(q_tf)
S_sparse = (Tf @ q_tf.T).toarray().ravel()

# Blend
alpha = 0.7
S = alpha*S_dense + (1-alpha)*S_sparse
for i in np.argsort(-S):
    print(f"{S[i]:.3f} :: {corpus[i]}")

Example 5 — Rerank with a cross-encoder

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

corpus = [
  "To reset your password, click the link sent to your email.",
  "We support meeting scheduling for teams.",
  "Two-factor authentication improves account security.",
  "Password length must be at least 12 characters."
]

bi = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
E = bi.encode(corpus, convert_to_numpy=True, normalize_embeddings=True)
query = "I forgot my password"
q = bi.encode([query], convert_to_numpy=True, normalize_embeddings=True)

# Retrieve top-3
scores = (E @ q.T).ravel()
topk_idx = np.argsort(-scores)[:3]
candidates = [corpus[i] for i in topk_idx]

# Rerank with cross-encoder (slower but more precise)
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, c) for c in candidates]
rerank_scores = ce.predict(pairs)
for s, c in sorted(zip(rerank_scores, candidates), key=lambda x: -x[0]):
    print(f"{s:.3f} :: {c}")

Tip: choosing k for reranking

Common pattern: retrieve 50–200 candidates with ANN, rerank top 10–50 with a cross-encoder. Tune on validation metrics to balance latency and quality.

Drills and exercises

Normalize embeddings and confirm cosine == dot product numerically.
Compare top-10 from Flat vs HNSW; compute Recall@10 difference.
Run hybrid search with 5 alpha values; report best nDCG@10.
Evaluate chunk sizes (200/400/800 words); pick best by MRR@10.
Add metadata filtering (e.g., doc_type) and verify it improves precision.

Common mistakes and debugging tips

Unnormalized vectors: Cosine similarity breaks when vectors are not L2-normalized. Normalize after every encode.
Mismatch of similarity vs distance: Some indexes expect inner product; others expect L2. Align your index metric with how you compute scores.
Overly large chunks: Long chunks dilute relevance and increase latency. Start with 200–400 words and small overlaps.
Ignoring domain language: Generic models may miss domain terms. Try domain-tuned or multilingual variants where appropriate.
No evaluation set: Tuning without metrics leads to regressions. Build a small but representative labeled set early.
Reranking everything: Cross-encoders are slow. Retrieve many, rerank few.

Debugging checklist

Print nearest neighbors of a known query and manually judge quality.
Plot score histograms for positive vs negative examples.
A/B test alpha in hybrid scoring and log metrics.
Verify index recall by comparing ANN results to exact kNN on a small subset.

Building retrieval evaluation sets

Collect queries from real user intents or logs (anonymized/aggregated).
For each query, assign relevant documents with graded labels (0=not relevant, 1=somewhat, 2=relevant, 3=highly relevant).
Include hard negatives (similar surface text, wrong meaning) to stress-test.
Split into train/validation; keep a hidden test set for final checks.
Document annotation guidelines to keep labels consistent.

Metrics: Recall, MRR, nDCG

Recall@k: Fraction of queries where at least one relevant item appears in the top-k.
MRR@k: Mean of 1/rank of the first relevant item (0 if none in top-k). Rewards putting a relevant result at rank 1.
nDCG@k: Accounts for graded relevance and position. Higher for highly relevant items at top ranks.

Quick manual calculation example

Suppose a query's top-5 results have graded labels [3,0,2,0,1]. DCG@5 = 3/log2(2) + 2/log2(4) + 1/log2(6). Normalize by ideal DCG (sorted labels) to get nDCG@5.

Mini project: RAG-ready Q&A retriever

Prepare 50–200 knowledge-base articles; chunk into 200–400 word segments with 10–20% overlap. Store metadata (title, section, url_stub).
Encode chunks with a sentence embedding model; L2-normalize vectors.
Build an ANN index (e.g., HNSW or IVF). Persist it to disk.
Implement dense retrieval for top-50, optional sparse TF-IDF, and hybrid scoring with a tunable alpha.
Rerank top-20 with a cross-encoder; return top-5 with snippets and metadata.
Create an eval set of 30 queries with graded labels; compute Recall@5, MRR@10, nDCG@10. Try 3 configs and pick the best.

Stretch goals

Add simple filters: product=Pro, language=en.
Cache cross-encoder scores to reduce repeated latency.
Add a confidence score and fallback to keyword-only when low.

Practical projects

Semantic duplicate detector: flag near-duplicate FAQs using cosine similarity thresholds.
Topic explorer: cluster embeddings (e.g., k-means) and label clusters with representative terms.
Multilingual retrieval: evaluate an English–Spanish corpus with a multilingual embedding model and compare to monolingual baselines.

Next steps

Implement the mini project and record baseline metrics.
Tune chunk size, index parameters, and hybrid weights; rerun metrics.
Integrate with your application and monitor real-user queries for continuous improvement.

Menu

Embeddings And Retrieval

Table of Contents

Why Embeddings and Retrieval matter for an NLP Engineer

Who this is for

Prerequisites

Learning path

Worked examples

Example 1 — Create sentence embeddings and compute cosine similarity

Example 2 — Build a FAISS index and run ANN search

Example 3 — Chunk documents with overlap and metadata

Example 4 — Hybrid search: dense + sparse TF-IDF

Example 5 — Rerank with a cross-encoder

Drills and exercises

Common mistakes and debugging tips

Building retrieval evaluation sets

Metrics: Recall, MRR, nDCG

Mini project: RAG-ready Q&A retriever

Practical projects

Next steps

Embeddings And Retrieval — Skill Exam

Topics

Embedding Model Selection

Vector Index Concepts

Creating Sentence Embeddings

Similarity Search And Reranking Basics

Chunking Strategies

Hybrid Search Basics

Building Retrieval Evaluation Sets

Retrieval Metrics Recall Mrr Ndcg

Have questions about Embeddings And Retrieval?

AI Assistant