Why this matters
As a Computer Vision Engineer, you will turn raw pixels into signals a model can use. Convolution filters detect edges, corners, textures, and shapes. These low-level features stack into higher-level patterns like eyes, wheels, and road markings. Real tasks include:
- Improving defect detection by tuning early convolution layers to highlight scratches or dents.
- Stabilizing object detection by controlling stride/padding to preserve small objects.
- Speed-accuracy trade-offs: choosing kernel sizes and strides to meet latency budgets.
Concept explained simply
Convolution slides a small matrix (a kernel/filter) across an image and computes a weighted sum at each position. The result at each location is a feature response. Stack many filters, and you get a feature map per filter.
Mental model
Imagine a transparent stencil with numbers on it (the kernel). You lay it on the image, multiply each overlapped pixel by the stencil number, and add them up. Shift the stencil by one step and repeat. Certain stencils “light up” vertical edges, others highlight corners or textures.
Core pieces to know
- Kernel size (k×k): Controls the local pattern size you can detect (e.g., 3×3 for edges).
- Stride (s): Step size when sliding. s=2 halves spatial size.
- Padding (p): Zeros added around borders. SAME padding preserves size (with s=1), VALID uses no padding and shrinks.
- Dilation: Spreads kernel taps to grow the receptive field without more parameters.
- Channels: For RGB inputs, each filter has weights for every input channel; outputs have as many channels as filters.
- Feature map: The output per filter; indicates where the pattern exists.
- Pooling or strided conv: Downsamples features to reduce size and add invariance.
Worked examples
Example 1: 1D intuition
Signal: [3, 4, 5, 6], Kernel: [1, 0, -1], stride=1, no padding. Responses:
- Position 1 on [3,4,5]: 3*1 + 4*0 + 5*(-1) = -2
- Position 2 on [4,5,6]: 4*1 + 5*0 + 6*(-1) = -2
Output: [-2, -2]. This kernel detects increases-to-the-right (like a simple edge).
Example 2: 2D edge detector (valid, stride 1)
Input I (4×4):
[[1,1,2,2], [1,1,2,2], [3,3,4,4], [3,3,4,4]]
Kernel K (3×3 vertical):
[[-1,0,1], [-1,0,1], [-1,0,1]]
Output size: 2×2. Each patch times K (cross-correlation) sums to 3. Output:
[[3,3], [3,3]]
Example 3: SAME padding vs VALID
Input 28×28, kernel 3×3, stride 1.
- SAME (p=1): output 28×28 (keeps size).
- VALID (p=0): output 26×26 (28 - 3 + 1).
Example 4: Multi-channel convolution
Input: 32×32×3 (RGB). Filter: 3×3 with 3 input-channel slices. Parameters per filter: 3×3×3 + 1 bias = 28. With 64 filters: 64×28 = 1792 parameters.
How it looks in code (minimal)
# 2D cross-correlation (single-channel) pseudo-code
# I: HxW image, K: kxk kernel, stride s, padding p
def conv2d(I, K, s=1, p=0):
import numpy as np
H, W = I.shape
k = K.shape[0]
Ipad = np.pad(I, ((p,p),(p,p)), mode='constant')
out_h = (H + 2*p - k)//s + 1
out_w = (W + 2*p - k)//s + 1
O = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
patch = Ipad[i*s:i*s+k, j*s:j*s+k]
O[i,j] = np.sum(patch * K) # cross-correlation
return O
Exercises you can do now
These mirror the graded exercises below. Do them on paper or in a notebook, then check your work.
- ex1: Compute a 2D cross-correlation (valid, stride 1) on I = [[1,1,2,2],[1,1,2,2],[3,3,4,4],[3,3,4,4]] with K = [[-1,0,1],[-1,0,1],[-1,0,1]]. What is the 2×2 output?
- ex2: Input 64×64×3. Conv layer: 32 filters of 5×5, stride 2, SAME padding. Then MaxPool 2×2, stride 2. What are the spatial sizes after conv and after pool? How many conv parameters (include one bias per filter)?
- [ ] I wrote down stride, padding, and kernel size before computing.
- [ ] I verified output shapes using formulas, not just intuition.
- [ ] I separated parameter-count logic from output-size logic.
Common mistakes and self-check
- Mixing convolution with correlation: Most deep learning libraries use cross-correlation (no kernel flip). Self-check: Does your manual calc match library default? If yes, you used correlation.
- Forgetting bias terms: Each filter commonly has one bias. Self-check: Parameters = (k×k×Cin + 1)×Cout (if bias used).
- Wrong padding assumptions: SAME with stride 1 preserves size; VALID shrinks. Self-check: Re-derive using formula Hout = floor((H + 2p − k)/s) + 1.
- Channel mismatch: Filter depth must equal input channels. Self-check: Kernel shape k×k×Cin.
- Over-downsampling early: Large stride/pool too soon may erase small objects. Self-check: Visualize feature map resolution vs object size.
Practical projects
- Build Sobel edge detector from scratch. Compare horizontal vs vertical responses and visualize magnitudes.
- Train a tiny CNN on digit images. Experiment with stride 1 vs stride 2 in the first layer and report accuracy vs inference time.
- Feature visualization: Capture and plot the first-layer feature maps for a few images; identify which filters act like edge or texture detectors.
- Receptive field exploration: Stack 3×3 layers and compute the effective receptive field after each layer. Confirm by probing with synthetic patterns.
Who this is for
- Beginners who know arrays and want to understand CNN building blocks.
- Engineers moving from classical image processing to deep learning.
- Practitioners optimizing early layers for detection/segmentation tasks.
Prerequisites
- Comfort with basic linear algebra (vectors, matrices) and arithmetic.
- Basics of images as arrays (height, width, channels).
- Optional: Some Python/NumPy to prototype calculations.
Learning path
- Master convolution mechanics: kernel/stride/padding, output shape, parameter count.
- Connect to feature extraction: edges, corners, textures; visualize feature maps.
- Downsampling decisions: pooling vs strided conv; when and why.
- Scale up: multi-channel, multi-filter layers; receptive field growth.
- Validate with small projects and the quick test.
Mini challenge
Given input 32×32×3 → Conv (16 filters, 3×3, stride 1, SAME) → MaxPool (2×2, stride 2) → Conv (32 filters, 3×3, stride 1, SAME).
- What are the spatial sizes after each stage?
- How many parameters are in the first and last conv layers (include bias)?
Reveal answer
- After first conv: 32×32×16 (SAME, s=1).
- After pool: 16×16×16.
- After last conv: 16×16×32.
- Params first conv: (3×3×3 + 1)×16 = (27+1)×16 = 448.
- Params last conv: (3×3×16 + 1)×32 = (144+1)×32 = 4640.
Next steps
- Repeat the exercises without notes until you can compute shapes and parameters in seconds.
- Implement a tiny CNN and print intermediate feature maps to build intuition.
- Proceed to the quick test below to check mastery.
Quick Test
The quick test is available to everyone. Only logged-in users get saved progress.