Why this skill matters for Computer Vision Engineers
Feature extraction and embeddings turn raw pixels into compact vectors that capture visual meaning. This skill powers visual search, recommendation, near-duplicate detection, clustering large image libraries, and cross-modal retrieval (e.g., text-to-image). In production, strong embeddings cut costs (fewer heavy models per query), scale to millions of images with vector indexes, and enable rapid iteration via re-ranking and evaluation.
Who this is for
- Computer Vision Engineers moving from classification/detection toward retrieval, search, and large-scale image understanding.
- ML engineers building content discovery, deduplication, or recommendation features.
- Researchers and practitioners interested in metric learning and representation learning.
Prerequisites
- Comfortable with Python and NumPy.
- Basic PyTorch or TensorFlow for inference/training.
- Understanding of convolutional or transformer-based vision backbones.
- Familiarity with cosine similarity, L2 normalization, and train/validation splits.
Quick refresher: embeddings vs. features
Features are intermediate representations from a model. Embeddings are usually the final, fixed-length vectors used for similarity, indexing, and clustering. Good embeddings place similar items close together and dissimilar items far apart under a chosen distance metric (commonly cosine distance or Euclidean).
Learning path
- Week 1: Build image embeddings using a pre-trained backbone. Normalize vectors. Run basic similarity search with a small dataset.
- Week 2: Scale to a vector index (Flat, IVF, HNSW). Measure recall@K and latency. Detect near-duplicates via thresholds.
- Week 3: Introduce hard negative mining and fine-tuning. Add a re-ranker to boost precision at top-K.
- Week 4: Evaluate with retrieval metrics (mAP, NDCG, P@K). Cluster the collection. Try cross-modal embeddings for text-to-image retrieval.
Practical roadmap and milestones
Milestone 1 — Solid baseline embeddings
- Use a pre-trained ResNet or ViT to extract 512–2048D vectors.
- L2-normalize vectors; prefer cosine similarity for retrieval.
- Verify basic nearest neighbor results on a small labeled set.
Milestone 2 — Fast search at scale
- Index 100k–1M vectors with IVF/HNSW. Tune nlist/nprobe or efConstruction/efSearch.
- Target recall@10 ≥ 0.95 of exact search with 3–10× speedup.
Milestone 3 — Quality boost with mining and re-ranking
- Mine hard negatives within mini-batches or from the index.
- Re-rank top 50–200 candidates using a heavier model or local feature matching.
Milestone 4 — Evaluation and maintenance
- Track mAP, NDCG, Precision@K, and pairwise duplicate precision/recall.
- Set up drift checks: embedding norm distribution, nearest-neighbor distance histogram.
Worked examples
Example 1 — Build image embeddings with PyTorch
import torch
import torchvision.models as models
import torchvision.transforms as T
from PIL import Image
import numpy as np
# Load a pre-trained backbone and remove the classifier head
backbone = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
backbone.fc = torch.nn.Identity()
backbone.eval()
transform = T.Compose([
T.Resize(256),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def embed_image(path):
img = Image.open(path).convert('RGB')
x = transform(img).unsqueeze(0)
with torch.no_grad():
v = backbone(x).squeeze(0).numpy()
# L2-normalize for cosine similarity
v = v / (np.linalg.norm(v) + 1e-12)
return v # 2048-d normalized vector
v1 = embed_image('cat1.jpg')
v2 = embed_image('cat2.jpg')
cos_sim = float(np.dot(v1, v2))
print('Cosine similarity:', cos_sim)
Tip: Always normalize embeddings if you use cosine or inner product search.
Example 2 — Similarity search with a vector index (FAISS, FlatIP)
import faiss
import numpy as np
# Suppose we have N normalized vectors of dim D
D = 2048
index = faiss.IndexFlatIP(D) # inner product on normalized vectors ~ cosine
# Build an index
db = np.load('db_vectors.npy').astype('float32') # shape (N, D), already normalized
index.add(db)
# Query
q = np.load('query_vectors.npy').astype('float32') # shape (Q, D), normalized
k = 5
sims, ids = index.search(q, k)
print('Top-5 ids per query:', ids)
print('Similarities:', sims)
Switch to IVF or HNSW for speed on large datasets, and tune the search parameters to recover high recall of the exact Flat index.
Example 3 — Near-duplicate detection via thresholds
import numpy as np
# db: (N, D) normalized vectors
# Choose a conservative duplicate threshold
DUP_THR = 0.97 # adjust per dataset
# For a small batch, check pairwise duplicates
def find_near_duplicates(batch):
# batch: (B, D) normalized
sims = batch @ batch.T
pairs = []
B = batch.shape[0]
for i in range(B):
for j in range(i+1, B):
if sims[i, j] >= DUP_THR:
pairs.append((i, j, float(sims[i, j])))
return pairs
Start with a high threshold to avoid false positives, then tune based on labeled duplicate pairs.
Example 4 — Hard negative mining (in-batch)
import torch
# embeddings: (B, D) normalized image embeddings
# pos_index[i] gives the index of the positive for anchor i in the batch
def hardest_negatives(embeddings, pos_index):
with torch.no_grad():
sims = embeddings @ embeddings.T # cosine similarity
B = embeddings.size(0)
hard_neg_idx = []
for i in range(B):
sims_i = sims[i]
sims_i[i] = -1.0 # ignore self
sims_i[pos_index[i]] = -1.0 # ignore positive
hard_neg = torch.argmax(sims_i).item() # most similar wrong item
hard_neg_idx.append(hard_neg)
return hard_neg_idx
Use mined negatives to train with contrastive/triplet losses and improve retrieval quality.
Example 5 — Cross-modal retrieval with a CLIP-like model
# Pseudocode to illustrate the flow
# image_encoder, text_encoder produce normalized embeddings in the same space
def text_to_image_search(text_queries, image_vecs):
tq = text_encoder(text_queries) # (T, D), normalized
sims = tq @ image_vecs.T # (T, N)
topk = np.argsort(-sims, axis=1)[:, :10]
return topk
Cross-modal embeddings enable text-to-image and image-to-text search with the same index.
Example 6 — Re-ranking with local feature matching (ORB)
import cv2
# After coarse retrieval by embeddings, re-rank top-50 using keypoint matches
def orb_match_score(imgA, imgB):
orb = cv2.ORB_create(1000)
kp1, des1 = orb.detectAndCompute(imgA, None)
kp2, des2 = orb.detectAndCompute(imgB, None)
if des1 is None or des2 is None:
return 0
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda m: m.distance)
return -np.mean([m.distance for m in matches[:30]]) if matches else 0
Local matching boosts precision for near-duplicates or products with small differences.
Drills and exercises
- Extract embeddings for 1k images; plot the norm distribution and confirm normalization.
- Compare cosine vs Euclidean distances on the same normalized vectors.
- Build a Flat index and measure latency/recall vs brute-force NumPy search.
- Switch to IVF or HNSW; tune parameters to reach ≥95% recall@10.
- Label 100 duplicate/non-duplicate pairs; find a working similarity threshold.
- Implement in-batch hard negative mining and confirm loss decreases.
- Add a re-ranker and measure change in Precision@5.
- Compute mAP and NDCG for your validation split.
- Cluster embeddings (k-means); inspect 10 random clusters qualitatively.
- Run cross-modal retrieval: 20 text queries → top-10 images; review failures.
Common mistakes and debugging tips
- Not normalizing embeddings when using cosine/inner product. Fix: L2-normalize at both index build and query time.
- Comparing metrics across different candidate set sizes. Fix: Keep K and candidate pools consistent.
- Too-aggressive ANN settings causing recall collapse. Fix: Increase nprobe (IVF) or efSearch (HNSW) and re-measure.
- Mismatched preprocessing between index and query pipelines. Fix: Share the exact transform code and versions.
- Thresholds copied from another dataset. Fix: Calibrate on your labels; start conservative; iterate.
- Ignoring class imbalance in evaluation. Fix: Use mAP/NDCG and report metrics per category when relevant.
- Overfitting during fine-tuning with only easy negatives. Fix: Mine hard negatives or use semi-hard strategies.
Mini project: Visual search and dedup for a product catalog
- Collect 5k–20k product images; label 200 duplicate pairs and 200 non-duplicate similar pairs.
- Extract normalized embeddings with a pre-trained model; save as float32 vectors.
- Build an IVF or HNSW index; target 95% recall@10 vs Flat.
- Implement near-duplicate detection with a tuned cosine threshold.
- Add re-ranking for top-50 candidates using local ORB matches or a small binary classifier.
- Evaluate: Precision@5, Recall@50, mAP, duplicate precision/recall. Track latency.
- Optionally fine-tune with contrastive loss using mined hard negatives; re-evaluate.
Evaluation: retrieval metrics to track
- Precision@K and Recall@K per query.
- mAP and NDCG on labeled query-gallery splits.
- Duplicate detection precision/recall at a fixed threshold.
- Latency: P50/P95 for build and query; index size on disk.
Practical projects
- Personal photo search: retrieve similar photos and cluster events.
- Brand logo monitoring: detect duplicates and near-duplicates from social images.
- Fashion recommendation: image-to-image retrieval with re-ranking by keypoint matches.
Subskills
- Building Image Embeddings — Extract compact, normalized vectors from images using modern backbones.
- Similarity Search For Images — Use vector indexes for fast nearest neighbors at scale.
- Vector Index Concepts — Understand Flat, IVF, HNSW trade-offs and tuning.
- Hard Negative Mining Basics — Improve training by selecting challenging negatives.
- Reranking And Retrieval Evaluation — Boost precision with re-rankers and measure with mAP/NDCG.
- Near Duplicate Detection — Calibrate thresholds for robust duplicate finding.
- Clustering Image Collections — Group images to explore and deduplicate libraries.
- Cross Modal Embeddings Basics — Align text and image spaces for cross-modal search.
Next steps
- Pick one project above and complete it end-to-end with metrics and a short report.
- Try two different backbones and compare retrieval quality vs latency.
- Automate evaluation and drift checks; schedule periodic re-indexing.
Skill exam
Take the exam below to check mastery. Everyone can take it for free; if you are logged in, your progress and results will be saved.