Who this is for
- Computer Vision beginners who know basic Python and want practical CNN intuition.
- Engineers choosing a backbone for classification, detection, or segmentation.
- Students preparing to fine-tune pre-trained CNNs for real projects.
Prerequisites
- Comfort with tensors, channels, and basic linear algebra.
- Basics of convolution, ReLU, and pooling.
- Know how to read layer shapes (N, C, H, W) and compute output sizes.
Why this matters
In real Computer Vision work, you will:
- Pick a backbone that balances accuracy, speed, and memory for your device.
- Estimate output shapes to wire backbones into heads (e.g., classification head, FPN, detection head).
- Decide what to freeze and fine-tune when adapting pre-trained models.
- Debug latency issues by identifying the heaviest stages (resolution vs width vs depth).
Concept explained simply
A CNN backbone is the feature extractor of a vision model. It converts an image into a stack of informative feature maps at progressively lower resolution and higher semantic richness. Heads (classifier, detector, segmenter) use these features to make predictions.
Mental model
Picture a camera zooming out: early layers capture edges and colors (fine detail), later layers capture textures and object parts (coarse, semantic detail). Each stage downsamples the image to keep compute manageable while growing channels to store richer features.
Key ideas
Convolution, stride, padding, and output size
Output size per dimension: out = floor((in + 2*pad - kernel) / stride) + 1. Stride > 1 downsamples. Padding preserves border coverage.
Receptive field
The region of input influencing one output feature. Deeper stacks and larger kernels increase it, enabling recognition of bigger structures.
Blocks and stages
Backbones are built from repeated blocks (e.g., residual blocks) grouped into stages. Each stage often halves resolution and increases channels.
Normalization and activation
BatchNorm (or LayerNorm) stabilizes training; ReLU/SiLU add nonlinearity. These choices affect accuracy/latency.
Popular backbones at a glance
- VGG: Simple stacks of 3x3 convs. Easy to read; heavy compute and parameters.
- ResNet: Residual connections ease training of deep nets. Strong general baseline.
- DenseNet: Dense connections reuse features; efficient feature propagation; can be memory-heavy.
- MobileNet (V1/V2/V3): Depthwise separable convs for mobile efficiency; use width/resolution multipliers.
- EfficientNet: Scales depth/width/resolution using compound scaling for strong accuracy/efficiency.
- ConvNeXt: Modernized conv design with strong accuracy, ViT-inspired choices.
Worked examples
Example 1 — Output shapes through layers
Input: 224x224x3.
- Conv 7x7, stride 2, pad 3, 64 channels → spatial: floor((224+6-7)/2)+1 = 112 → 112x112x64.
- MaxPool 3x3, stride 2, pad 1 → 56x56x64.
- Conv 3x3, stride 1, pad 1, 64 → 56x56x64 (same spatial).
Example 2 — Compute and resolution
Halving input size from 224 to 112 reduces FLOPs roughly by 4x in early layers (area scales with HxW). Width/channel changes scale compute roughly linearly; depth scales roughly linearly with number of layers.
Example 3 — Picking a backbone
Task: real-time mobile classification at 30 FPS. Choose MobileNetV2/V3 or EfficientNet-B0/Lite over ResNet-50 due to lower latency. If accuracy is slightly low, try width 1.4x or a modest resolution bump within budget.
Try it — quick checks
Q1: What happens to receptive field as depth increases?
It increases, allowing features to capture larger patterns and object parts.
Q2: Why use residual connections?
They ease gradient flow, stabilize training, and allow deeper networks.
Q3: When would you freeze early layers?
When the dataset is small or similar to pretraining data; it reduces overfitting and speeds training.
Exercises
Do these before the quick test. You can take the test without login; only logged-in users get saved progress.
- Exercise 1: Compute output shapes and parameter counts for a tiny backbone.
- Exercise 2: Choose a backbone under constraints and justify your pick.
Exercise 1 — Tiny CNN shapes and params
Input: 128x128x3
- Conv 3x3, stride 2, pad 1, 32 channels
- Conv 3x3, stride 1, pad 1, 32 channels
- MaxPool 2x2, stride 2
- Conv 3x3, stride 2, pad 1, 64 channels
Find the spatial size and tensor shape after each step, and total parameter count (assume biases).
Exercise 2 — Backbone choice
Scenario: Mobile app needs 30 FPS classification at 224x224 with competitive accuracy. Choose one: ResNet-50, MobileNetV2 (width 1.0), EfficientNet-B0. Explain your choice and one tuning you would try if accuracy is slightly low but latency is tight.
Common mistakes and self-check
- Mistake: Ignoring input resolution when estimating latency. Self-check: Recompute FLOPs after changing HxW.
- Mistake: Overfitting small data by unfreezing everything. Self-check: Compare validation curves with early layers frozen vs unfrozen.
- Mistake: Mismatched shapes into heads. Self-check: Print tensor shapes at stage outputs (e.g., C3/C4/C5).
- Mistake: Using heavy backbones on edge devices. Self-check: Profile inference time on the target device, not just a desktop.
Practical projects
- Transfer-learn a MobileNetV2 on a 5–10 class custom dataset. Start with frozen backbone, then unfreeze top 1–2 stages.
- Swap backbones in a detection framework (e.g., from ResNet-50 to MobileNetV3) and compare mAP vs latency at 320 and 640 resolution.
- Build a lightweight feature extractor using depthwise separable convs and compare accuracy/compute against a standard ResNet block.
Learning path
- Backbone basics (this lesson): shapes, stages, compute trade-offs.
- Modern CNN blocks: residual, bottleneck, depthwise separable, inverted residual.
- Multi-scale features: feature pyramids and necks.
- Heads for tasks: classification, detection, segmentation.
- Efficient training and fine-tuning strategies.
Mini challenge
You must deploy a segmentation model on an embedded device with 1 GB RAM and strict latency. Which two levers do you try first and why?
Show one possible approach
(1) Reduce input resolution to cut FLOPs quadratically; (2) switch to a mobile-efficient backbone (MobileNet/EfficientNet-Lite) and use a width multiplier < 1.0. Then profile.
Next steps
- Complete the exercises above.
- Take the quick test to check understanding (available to everyone; only logged-in users get saved progress).
- Start a small transfer learning project and track both accuracy and latency.