Why this matters
Metric learning teaches models to map images to a space where similar items are close and different items are far. As a Computer Vision Engineer, you use it to:
- Face verification and de-duplication (is this the same person?).
- Product image search and near-duplicate detection in catalogs.
- Person re-identification across cameras.
- Image retrieval for landmarks, logos, and species identification.
- Patch matching for tracking and local feature matching.
Concept explained simply
Metric learning creates an embedding function f(x) that converts an image x into a vector. We then compare vectors with a distance or similarity measure. Training teaches f(x) so that vectors from the same identity/class are close, and others are far.
Mental model
Imagine placing each image as a dot on a unit sphere. The model learns to place dots from the same identity next to each other, and different identities on opposite sides. Distance is often cosine distance (angle between vectors) or Euclidean distance on L2-normalized vectors.
Siamese vs Triplet
- Siamese/Contrastive: use pairs (x1, x2) and a label same/different. Loss pulls same pairs together and pushes different pairs apart by a margin.
- Triplet: use (anchor a, positive p, negative n). Enforce distance(a, p) + margin < distance(a, n). Margin is a small constant like 0.2.
Triplet loss intuition
For every anchor a, we want the positive p to be closer than any negative n by at least a margin m. If that is already true, the loss is zero. If not, the gradient nudges the embedding space to satisfy it.
Distances and normalization
- L2-normalize embeddings: e = f(x) / ||f(x)||.
- Cosine similarity sim = e1 · e2. Cosine distance = 1 − sim.
- Euclidean distance on L2-normalized vectors is consistent with cosine.
Architecture essentials
- Backbone: ResNet, MobileNet, or ViT accept images and produce features.
- Projection head: a small MLP or linear layer to the desired embedding size (e.g., 128 or 512).
- Normalization: L2-normalize the final embedding vector.
- Loss: contrastive loss (pairs) or triplet loss (triplets). For a starter, use triplet with margin 0.2–0.5.
- Optimization: Adam or SGD, moderate learning rate, weight decay to stabilize.
- Evaluation: k-NN search in embedding space, metrics like Recall@K or verification accuracy.
Data and sampling
Triplet methods rely on good sampling; easy pairs/triplets produce no learning signal.
- PK sampling: pick P classes per batch, K images each (e.g., P=8, K=4 → 32 images). This enables many positives and negatives within one batch.
- Online mining: form hard or semi-hard pairs/triplets from the current batch distances.
- Semi-hard negatives: negatives farther than the positive but within the margin window; safer than the hardest negatives early in training.
- Augmentations: flips, color jitter, crop/resize; keep identity intact.
- Label noise: noisy labels can destroy metric learning; clean data helps a lot.
Batch construction recipe
- Randomly choose P identities/classes.
- Sample K images per identity.
- Forward pass all PK images to get embeddings.
- Build positives within each identity and negatives across identities.
- Mine semi-hard negatives per anchor.
- Compute loss and backpropagate.
Worked examples
Example 1: Face verification with triplet loss
- Backbone: lightweight ResNet.
- Embedding: 128-D with L2 normalization.
- Batch: P=16 identities, K=4 images.
- Mining: semi-hard within batch, margin=0.3.
- Eval: verification accuracy on held-out pairs at a cosine threshold.
- Tip: freeze early BatchNorm for a few epochs to stabilize.
Example 2: Product image retrieval with contrastive loss
- Pairs from same SKU as positives; different SKUs as negatives.
- Use strong color/lighting augmentations to improve invariance.
- Contrastive loss margin=1.0 on Euclidean distance.
- Eval: Recall@1, Recall@10 on a retrieval benchmark split.
Example 3: Person re-identification with PK-sampling
- Backbone: ResNet-50 with stride adjustments for higher resolution.
- Sampling: P=8, K=8 to boost in-batch positives.
- Use random erasing and crop jitter; keep aspect ratio consistent.
- Eval: CMC (Rank-k) and mAP using cosine similarity.
Example 4: Patch matching for tracking
- Construct positives by sampling patches of the same object across frames.
- Negatives from different objects or background patches.
- Small 64-D embeddings are often enough; L2-normalize.
Build your first metric-learning model (step-by-step)
- Choose backbone and embedding size (e.g., ResNet-18 → 128-D).
- Add projection head and L2 normalization.
- Implement PK batch sampler.
- Compute all pairwise distances in the batch.
- Mine semi-hard triplets per anchor.
- Apply triplet loss with margin 0.2–0.5.
- Train with Adam, monitor Recall@1 on a small validation set.
- When stable, try harder mining or increase P and K.
Quick sanity checks during training
- Embedding norms are ~1.0 (post L2-norm).
- Within-class distances decrease over epochs.
- Recall@1 improves and then plateaus; reduce LR when plateauing.
Exercises
These mirror the exercises in the Exercises panel below.
Exercise 1: Normalize embeddings and compute distances
- Create a tiny dataset: 3 identities (A,B,C), 3 images each (total 9).
- Mock embeddings (e.g., random but cluster A near [1,0,0], B near [0,1,0], C near [0,0,1]).
- L2-normalize; compute cosine similarities and distances.
- For each anchor, identify the closest positive and the closest negative.
- Report if closest positive is nearer than closest negative for at least 80% of anchors.
Exercise 2: PK sampling and semi-hard triplets
- Simulate a batch with P=3 identities, K=3 images (9 total).
- Compute the distance matrix.
- For each anchor, choose a positive (same identity) and a semi-hard negative: farther than the positive but within margin m=0.3 of the anchor.
- List all valid triplets and count them.
Exercise checklist
- Embeddings are L2-normalized.
- Distance matrix is symmetric with zeros on the diagonal.
- Positives are closer than negatives for most anchors.
- Semi-hard negatives satisfy distance(a,p) < distance(a,n) < distance(a,p)+m.
Common mistakes and self-check
- Forgetting L2 normalization before measuring cosine distance.
- Margin too large or too small, causing no learning or collapse.
- Sampling only easy pairs/triplets: loss quickly goes to zero.
- Insufficient positives per class in a batch (low K).
- Label noise producing contradictory constraints.
- Evaluating with train identities, overestimating performance.
Self-check routine
- Plot histograms of positive vs negative distances; they should separate over time.
- Track Recall@1 and ROC AUC on a clean validation split.
- Inspect mined negatives visually to ensure they are true negatives.
Practical projects
- Build a small face verification demo with triplet loss and a cosine threshold decision.
- Create a product image search tool: index embeddings and query with nearest neighbors.
- Implement person re-identification baseline and report Rank-1 and mAP on a small dataset split.
Mini challenge
Train a 128-D embedding model on a small multi-identity dataset. Target at least 85% Recall@1 on a held-out validation set using cosine similarity.
Acceptance criteria
- Uses PK sampling with P≥8, K≥4.
- L2 normalization enabled.
- Triplet loss with margin between 0.2 and 0.5.
- Validation code that computes Recall@1.
Who this is for
- Computer Vision Engineers building verification, retrieval, or de-duplication features.
- ML practitioners wanting robust embeddings over many classes.
Prerequisites
- Basic CNNs or vision backbones knowledge.
- Understanding of train/val splits and overfitting.
- Comfort with vector norms, dot product, cosine similarity.
Learning path
- Start: Contrastive vs Triplet loss and L2 normalization.
- Next: Mining strategies (hard, semi-hard, distance-weighted).
- Then: Proxy-based and margin losses (e.g., ArcFace-style), classification-to-embedding bridges.
- Deployment: ANN indexes, batch inference, and drift monitoring.
Next steps
- Complete the exercises above, then take the Quick Test below.
- Quick Test is available to everyone; logged-in users have their progress saved.
- Apply the mini challenge to a small dataset you can access.