How to learn Feature Extraction And Embeddings for Computer Vision Engineer for free

Why this skill matters for Computer Vision Engineers

Feature extraction and embeddings turn raw pixels into compact vectors that capture visual meaning. This skill powers visual search, recommendation, near-duplicate detection, clustering large image libraries, and cross-modal retrieval (e.g., text-to-image). In production, strong embeddings cut costs (fewer heavy models per query), scale to millions of images with vector indexes, and enable rapid iteration via re-ranking and evaluation.

Who this is for

Computer Vision Engineers moving from classification/detection toward retrieval, search, and large-scale image understanding.
ML engineers building content discovery, deduplication, or recommendation features.
Researchers and practitioners interested in metric learning and representation learning.

Prerequisites

Comfortable with Python and NumPy.
Basic PyTorch or TensorFlow for inference/training.
Understanding of convolutional or transformer-based vision backbones.
Familiarity with cosine similarity, L2 normalization, and train/validation splits.

Quick refresher: embeddings vs. features

Features are intermediate representations from a model. Embeddings are usually the final, fixed-length vectors used for similarity, indexing, and clustering. Good embeddings place similar items close together and dissimilar items far apart under a chosen distance metric (commonly cosine distance or Euclidean).

Learning path

Week 1: Build image embeddings using a pre-trained backbone. Normalize vectors. Run basic similarity search with a small dataset.
Week 2: Scale to a vector index (Flat, IVF, HNSW). Measure recall@K and latency. Detect near-duplicates via thresholds.
Week 3: Introduce hard negative mining and fine-tuning. Add a re-ranker to boost precision at top-K.
Week 4: Evaluate with retrieval metrics (mAP, NDCG, P@K). Cluster the collection. Try cross-modal embeddings for text-to-image retrieval.

Practical roadmap and milestones

Milestone 1 — Solid baseline embeddings

Use a pre-trained ResNet or ViT to extract 512–2048D vectors.
L2-normalize vectors; prefer cosine similarity for retrieval.
Verify basic nearest neighbor results on a small labeled set.

Milestone 2 — Fast search at scale

Index 100k–1M vectors with IVF/HNSW. Tune nlist/nprobe or efConstruction/efSearch.
Target recall@10 ≥ 0.95 of exact search with 3–10× speedup.

Milestone 3 — Quality boost with mining and re-ranking

Mine hard negatives within mini-batches or from the index.
Re-rank top 50–200 candidates using a heavier model or local feature matching.

Milestone 4 — Evaluation and maintenance

Track mAP, NDCG, Precision@K, and pairwise duplicate precision/recall.
Set up drift checks: embedding norm distribution, nearest-neighbor distance histogram.

Worked examples

Example 1 — Build image embeddings with PyTorch

import torch
import torchvision.models as models
import torchvision.transforms as T
from PIL import Image
import numpy as np

# Load a pre-trained backbone and remove the classifier head
backbone = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
backbone.fc = torch.nn.Identity()
backbone.eval()

transform = T.Compose([
    T.Resize(256),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

def embed_image(path):
    img = Image.open(path).convert('RGB')
    x = transform(img).unsqueeze(0)
    with torch.no_grad():
        v = backbone(x).squeeze(0).numpy()
    # L2-normalize for cosine similarity
    v = v / (np.linalg.norm(v) + 1e-12)
    return v  # 2048-d normalized vector

v1 = embed_image('cat1.jpg')
v2 = embed_image('cat2.jpg')
cos_sim = float(np.dot(v1, v2))
print('Cosine similarity:', cos_sim)

Tip: Always normalize embeddings if you use cosine or inner product search.

Example 2 — Similarity search with a vector index (FAISS, FlatIP)

import faiss
import numpy as np

# Suppose we have N normalized vectors of dim D
D = 2048
index = faiss.IndexFlatIP(D)  # inner product on normalized vectors ~ cosine

# Build an index
db = np.load('db_vectors.npy').astype('float32')  # shape (N, D), already normalized
index.add(db)

# Query
q = np.load('query_vectors.npy').astype('float32')  # shape (Q, D), normalized
k = 5
sims, ids = index.search(q, k)
print('Top-5 ids per query:', ids)
print('Similarities:', sims)

Switch to IVF or HNSW for speed on large datasets, and tune the search parameters to recover high recall of the exact Flat index.

Example 3 — Near-duplicate detection via thresholds

import numpy as np

# db: (N, D) normalized vectors
# Choose a conservative duplicate threshold
DUP_THR = 0.97  # adjust per dataset

# For a small batch, check pairwise duplicates
def find_near_duplicates(batch):
    # batch: (B, D) normalized
    sims = batch @ batch.T
    pairs = []
    B = batch.shape[0]
    for i in range(B):
        for j in range(i+1, B):
            if sims[i, j] >= DUP_THR:
                pairs.append((i, j, float(sims[i, j])))
    return pairs

Start with a high threshold to avoid false positives, then tune based on labeled duplicate pairs.

Example 4 — Hard negative mining (in-batch)

import torch

# embeddings: (B, D) normalized image embeddings
# pos_index[i] gives the index of the positive for anchor i in the batch

def hardest_negatives(embeddings, pos_index):
    with torch.no_grad():
        sims = embeddings @ embeddings.T  # cosine similarity
        B = embeddings.size(0)
        hard_neg_idx = []
        for i in range(B):
            sims_i = sims[i]
            sims_i[i] = -1.0  # ignore self
            sims_i[pos_index[i]] = -1.0  # ignore positive
            hard_neg = torch.argmax(sims_i).item()  # most similar wrong item
            hard_neg_idx.append(hard_neg)
    return hard_neg_idx

Use mined negatives to train with contrastive/triplet losses and improve retrieval quality.

Example 5 — Cross-modal retrieval with a CLIP-like model

# Pseudocode to illustrate the flow
# image_encoder, text_encoder produce normalized embeddings in the same space

def text_to_image_search(text_queries, image_vecs):
    tq = text_encoder(text_queries)  # (T, D), normalized
    sims = tq @ image_vecs.T         # (T, N)
    topk = np.argsort(-sims, axis=1)[:, :10]
    return topk

Cross-modal embeddings enable text-to-image and image-to-text search with the same index.

Example 6 — Re-ranking with local feature matching (ORB)

import cv2

# After coarse retrieval by embeddings, re-rank top-50 using keypoint matches

def orb_match_score(imgA, imgB):
    orb = cv2.ORB_create(1000)
    kp1, des1 = orb.detectAndCompute(imgA, None)
    kp2, des2 = orb.detectAndCompute(imgB, None)
    if des1 is None or des2 is None:
        return 0
    bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
    matches = bf.match(des1, des2)
    matches = sorted(matches, key=lambda m: m.distance)
    return -np.mean([m.distance for m in matches[:30]]) if matches else 0

Local matching boosts precision for near-duplicates or products with small differences.

Drills and exercises

Extract embeddings for 1k images; plot the norm distribution and confirm normalization.
Compare cosine vs Euclidean distances on the same normalized vectors.
Build a Flat index and measure latency/recall vs brute-force NumPy search.
Switch to IVF or HNSW; tune parameters to reach ≥95% recall@10.
Label 100 duplicate/non-duplicate pairs; find a working similarity threshold.
Implement in-batch hard negative mining and confirm loss decreases.
Add a re-ranker and measure change in Precision@5.
Compute mAP and NDCG for your validation split.
Cluster embeddings (k-means); inspect 10 random clusters qualitatively.
Run cross-modal retrieval: 20 text queries → top-10 images; review failures.

Common mistakes and debugging tips

Not normalizing embeddings when using cosine/inner product. Fix: L2-normalize at both index build and query time.
Comparing metrics across different candidate set sizes. Fix: Keep K and candidate pools consistent.
Too-aggressive ANN settings causing recall collapse. Fix: Increase nprobe (IVF) or efSearch (HNSW) and re-measure.
Mismatched preprocessing between index and query pipelines. Fix: Share the exact transform code and versions.
Thresholds copied from another dataset. Fix: Calibrate on your labels; start conservative; iterate.
Ignoring class imbalance in evaluation. Fix: Use mAP/NDCG and report metrics per category when relevant.
Overfitting during fine-tuning with only easy negatives. Fix: Mine hard negatives or use semi-hard strategies.

Mini project: Visual search and dedup for a product catalog

Collect 5k–20k product images; label 200 duplicate pairs and 200 non-duplicate similar pairs.
Extract normalized embeddings with a pre-trained model; save as float32 vectors.
Build an IVF or HNSW index; target 95% recall@10 vs Flat.
Implement near-duplicate detection with a tuned cosine threshold.
Add re-ranking for top-50 candidates using local ORB matches or a small binary classifier.
Evaluate: Precision@5, Recall@50, mAP, duplicate precision/recall. Track latency.
Optionally fine-tune with contrastive loss using mined hard negatives; re-evaluate.

Evaluation: retrieval metrics to track

Precision@K and Recall@K per query.
mAP and NDCG on labeled query-gallery splits.
Duplicate detection precision/recall at a fixed threshold.
Latency: P50/P95 for build and query; index size on disk.

Practical projects

Personal photo search: retrieve similar photos and cluster events.
Brand logo monitoring: detect duplicates and near-duplicates from social images.
Fashion recommendation: image-to-image retrieval with re-ranking by keypoint matches.

Subskills

Building Image Embeddings — Extract compact, normalized vectors from images using modern backbones.
Similarity Search For Images — Use vector indexes for fast nearest neighbors at scale.
Vector Index Concepts — Understand Flat, IVF, HNSW trade-offs and tuning.
Hard Negative Mining Basics — Improve training by selecting challenging negatives.
Reranking And Retrieval Evaluation — Boost precision with re-rankers and measure with mAP/NDCG.
Near Duplicate Detection — Calibrate thresholds for robust duplicate finding.
Clustering Image Collections — Group images to explore and deduplicate libraries.
Cross Modal Embeddings Basics — Align text and image spaces for cross-modal search.

Next steps

Pick one project above and complete it end-to-end with metrics and a short report.
Try two different backbones and compare retrieval quality vs latency.
Automate evaluation and drift checks; schedule periodic re-indexing.

Skill exam

Take the exam below to check mastery. Everyone can take it for free; if you are logged in, your progress and results will be saved.

Menu

Feature Extraction And Embeddings

Table of Contents

Why this skill matters for Computer Vision Engineers

Who this is for

Prerequisites

Learning path

Practical roadmap and milestones

Worked examples

Drills and exercises

Common mistakes and debugging tips

Mini project: Visual search and dedup for a product catalog

Practical projects

Subskills

Next steps

Skill exam

Feature Extraction And Embeddings — Skill Exam

Topics

Building Image Embeddings

Similarity Search For Images

Vector Index Concepts

Near Duplicate Detection

Clustering Image Collections

Hard Negative Mining Basics

Reranking And Retrieval Evaluation

Cross Modal Embeddings Basics

Have questions about Feature Extraction And Embeddings?

AI Assistant