Why this matters for Computer Vision Engineers
Vision model architectures are the backbone of real-world visual systems. Choosing and adapting the right architecture determines speed, accuracy, robustness, and feasibility on target hardware. Mastering CNNs, Vision Transformers, detection/segmentation heads, keypoint models, metric learning, and multi-task designs lets you ship reliable solutions across classification, detection, segmentation, pose, retrieval, and more.
Who this is for
- Engineers building production CV systems (edge or cloud)
- ML practitioners moving from tabular/NLP into vision
- Students preparing for CV-focused roles
Prerequisites
- Python basics and PyTorch or Keras/TensorFlow familiarity
- Linear algebra and probability fundamentals
- Comfort with training loops, losses, and GPU usage
Learning path
Worked examples
Example 1: Minimal CNN backbone and head (PyTorch)
import torch
import torch.nn as nn
class BasicBlock(nn.Module):
def __init__(self, c):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(c, c, 3, padding=1, bias=False),
nn.BatchNorm2d(c), nn.ReLU(inplace=True),
nn.Conv2d(c, c, 3, padding=1, bias=False),
nn.BatchNorm2d(c)
)
self.act = nn.ReLU(inplace=True)
def forward(self, x):
return self.act(self.net(x) + x)
class TinyResNet(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.stem = nn.Sequential(
nn.Conv2d(3, 32, 3, stride=2, padding=1), nn.BatchNorm2d(32), nn.ReLU(inplace=True)
)
self.stage1 = nn.Sequential(BasicBlock(32), BasicBlock(32))
self.stage2 = nn.Sequential(nn.Conv2d(32, 64, 3, stride=2, padding=1), nn.ReLU(), BasicBlock(64))
self.pool = nn.AdaptiveAvgPool2d(1)
self.head = nn.Linear(64, num_classes)
def forward(self, x):
x = self.stem(x)
x = self.stage1(x)
x = self.stage2(x)
x = self.pool(x).flatten(1)
return self.head(x)
x = torch.randn(4, 3, 128, 128)
model = TinyResNet(num_classes=5)
logits = model(x)
print(logits.shape) # torch.Size([4, 5])
Why it matters: Understand how feature hierarchies form and how a head attaches to a backbone.
Example 2: Transfer learning with a pretrained backbone (PyTorch)
import torch
import torch.nn as nn
from torchvision.models import resnet18
num_classes = 3
m = resnet18(weights='IMAGENET1K_V1') # if unavailable, set weights=None
for p in m.layer1.parameters():
p.requires_grad = False # freeze shallow layers when data is small
m.fc = nn.Linear(m.fc.in_features, num_classes) # replace classifier head
# training loop sketch
opt = torch.optim.AdamW(filter(lambda p: p.requires_grad, m.parameters()), lr=3e-4)
criterion = nn.CrossEntropyLoss()
x = torch.randn(8, 3, 224, 224)
y = torch.randint(0, num_classes, (8,))
logits = m(x)
loss = criterion(logits, y)
loss.backward(); opt.step(); opt.zero_grad()
Tip: Start by freezing most layers, then gradually unfreeze if you see underfitting.
Example 3: Vision Transformer patch embedding
import torch
import torch.nn as nn
class PatchEmbed(nn.Module):
def __init__(self, img_size=224, patch=16, in_ch=3, dim=384):
super().__init__()
self.proj = nn.Conv2d(in_ch, dim, kernel_size=patch, stride=patch)
self.cls = nn.Parameter(torch.zeros(1, 1, dim))
self.pos = nn.Parameter(torch.zeros(1, (img_size//patch)**2 + 1, dim))
def forward(self, x):
x = self.proj(x) # B, dim, H/patch, W/patch
x = x.flatten(2).transpose(1, 2) # B, N, dim
B = x.size(0)
cls = self.cls.expand(B, -1, -1)
x = torch.cat([cls, x], dim=1)
return x + self.pos
B = 2
x = torch.randn(B, 3, 224, 224)
pe = PatchEmbed()
seq = pe(x)
print(seq.shape) # torch.Size([2, 197, 384])
Key concept: Images become token sequences; attention mixes information globally.
Example 4: Simple detection head on FPN features
import torch
import torch.nn as nn
class BoxHead(nn.Module):
def __init__(self, in_c, num_anchors=3, num_classes=4):
super().__init__()
self.cls = nn.Conv2d(in_c, num_anchors * num_classes, 3, padding=1)
self.reg = nn.Conv2d(in_c, num_anchors * 4, 3, padding=1)
def forward(self, f):
return self.cls(f), self.reg(f)
# fake FPN feature map (one level)
f = torch.randn(2, 256, 32, 32)
head = BoxHead(256)
cls_logits, box_deltas = head(f)
print(cls_logits.shape, box_deltas.shape) # ([2, A*C, H, W], [2, A*4, H, W])
Idea: Per-spatial-location anchors predict class and box offsets. Add NMS during inference.
Example 5: U-Net style segmentation with skip connections
import torch
import torch.nn as nn
class ConvBNReLU(nn.Sequential):
def __init__(self, cin, cout):
super().__init__(nn.Conv2d(cin, cout, 3, padding=1), nn.BatchNorm2d(cout), nn.ReLU(inplace=True))
class UNetTiny(nn.Module):
def __init__(self, classes=2):
super().__init__()
self.e1 = nn.Sequential(ConvBNReLU(3, 32), ConvBNReLU(32, 32))
self.p1 = nn.MaxPool2d(2)
self.e2 = nn.Sequential(ConvBNReLU(32, 64), ConvBNReLU(64, 64))
self.p2 = nn.MaxPool2d(2)
self.b = nn.Sequential(ConvBNReLU(64, 128), ConvBNReLU(128, 128))
self.u2 = nn.ConvTranspose2d(128, 64, 2, stride=2)
self.d2 = nn.Sequential(ConvBNReLU(128, 64), ConvBNReLU(64, 64))
self.u1 = nn.ConvTranspose2d(64, 32, 2, stride=2)
self.d1 = nn.Sequential(ConvBNReLU(64, 32), ConvBNReLU(32, 32))
self.out = nn.Conv2d(32, classes, 1)
def forward(self, x):
e1 = self.e1(x); x = self.p1(e1)
e2 = self.e2(x); x = self.p2(e2)
x = self.b(x)
x = self.u2(x); x = torch.cat([x, e2], dim=1); x = self.d2(x)
x = self.u1(x); x = torch.cat([x, e1], dim=1); x = self.d1(x)
return self.out(x)
x = torch.randn(1, 3, 128, 128)
print(UNetTiny()(x).shape) # torch.Size([1, 2, 128, 128])
Skip connections fuse fine details from the encoder with semantic context from the decoder.
Drills and exercises
- Implement a residual block and verify identical input/output shapes.
- Take a pretrained backbone, freeze all layers, train 1 epoch on a toy dataset; then unfreeze the last stage and compare loss curves.
- Build a tiny ViT: patchify, add positional embeddings, and run attention on random inputs.
- Create a detection head that outputs K anchors per location; verify tensor dimensions.
- Modify a U-Net to output 4 classes and train with Dice + CrossEntropy combined loss.
- Implement triplet loss; test behavior when the negative is too easy vs hard.
- Add a second task head (e.g., depth regression) and balance losses with a simple weighted sum.
Common mistakes and debugging tips
Frequent pitfalls
- Mismatched tensor shapes between backbone and heads. Tip: print shapes at every stage.
- Freezing too much or too little during transfer learning. Start conservative; unfreeze gradually.
- Ignoring normalization differences (e.g., ImageNet mean/std) when using pretrained weights.
- Too aggressive augmentations for segmentation/pose causing label drift. Validate on a small subset first.
- Unstable training with ViTs on small datasets. Use strong regularization or distillation; consider CNN backbones.
- Detection targets misaligned with anchors/strides. Verify coordinate systems and scale factors.
Debugging checklist
- Overfit a tiny subset (e.g., 50 images). If loss doesn’t go near zero, check labels, learning rate, and architecture wiring.
- Plot intermediate feature map statistics (mean/std) to catch dead activations.
- For detection, visualize anchors, matched boxes, and NMS outputs.
- For segmentation, overlay predicted masks with inputs; check class channel order.
- Log per-task losses in multi-task setups; rebalance if one dominates.
Mini project: Multi-head scene understanding
Goal: Build a shared backbone with three heads on a small custom dataset: classification (scene type), detection (count specific objects), and segmentation (foreground/background).
- Backbone: pretrained ResNet or MobileNet.
- Heads: softmax classifier, anchor-based detector, 2-class U-Net-style decoder.
- Loss: L = 0.5*CE_class + 1.0*DetLoss + 0.5*(Dice + CE_seg).
- Metrics: top-1 accuracy, mAP@0.5, mIoU.
- Deliverables: training script, config file, validation plots, and qualitative visualizations.
Practical project ideas
- Product detection and segmentation on shelf images (edge-speed optimized backbone).
- Pose estimation for fitness reps counting using heatmaps and temporal smoothing.
- Image retrieval with triplet-trained embeddings for near-duplicate search.
Next steps
- Profile inference time across backbones and input sizes; select targets per deployment device.
- Experiment with loss weighting schemes (uncertainty weighting or dynamic scaling) for multi-task setups.
- Add quantization-aware training or pruning after baseline accuracy is stable.
Subskills
- CNN Backbones Basics: Understand convolutional blocks, residuals, pooling, and global average pooling; wire a classifier head. Estimated time: 60–90 min.
- Transfer Learning For Vision: Load pretrained weights, freeze/unfreeze strategies, replace heads, and tune LR schedules. Estimated time: 60–120 min.
- Vision Transformers Basics: Patch embeddings, positional encodings, attention, and training tips for small data. Estimated time: 75–120 min.
- Object Detection Architectures Yolo Faster Rcnn: One-stage vs two-stage, anchors, FPN, losses, and NMS. Estimated time: 90–150 min.
- Segmentation Architectures Unet Deeplab: Encoder-decoder design, skip connections, dilated convs, and ASPP. Estimated time: 75–120 min.
- Keypoint And Pose Models Basics: Heatmap heads, decoding, and evaluation basics. Estimated time: 60–90 min.
- Metric Learning Siamese Triplet Basics: Contrastive/triplet losses, sampling, and retrieval evaluation. Estimated time: 60–120 min.
- Multi Task Vision Basics: Shared backbones, multi-head losses, and balancing strategies. Estimated time: 60–120 min.