Why this matters
Vision Transformers (ViT) are a core family of modern vision models. You will face them when:
- Fine-tuning a pretrained ViT for image classification, defect detection, or quality inspection.
- Choosing patch size and resolution to balance accuracy vs. speed/memory for production.
- Adapting positional embeddings when training at one resolution and serving at another.
- Estimating compute/memory needs due to attention’s quadratic cost in token count.
Who this is for
- Computer Vision Engineers who know CNNs and want a modern alternative.
- ML practitioners deploying image models under latency/memory constraints.
- Researchers or students needing a clear entry point into ViTs.
Prerequisites
- Comfort with tensors, channels, height/width, and batching.
- Basic understanding of attention and the Transformer encoder block.
- Familiarity with image classification pipelines and evaluation.
Concept explained simply
Vision Transformers treat an image like a sentence. Instead of words, we use image patches. Each patch becomes a token. A Transformer encoder then uses self-attention to let any patch attend to any other patch, capturing global relationships early.
Mental model (20 seconds)
- Image → grid of small tiles (patches).
- Each tile → a vector (token) via a linear projection.
- Add positional info so the model knows where each tile came from.
- Self-attention mixes information across all tiles.
- A special [CLS] token summarizes the image for classification.
Core components at a glance
- Patch embedding: split image into P×P patches, flatten, project with a linear layer.
- Positional encoding: learned or sinusoidal; tells the model patch locations. Often learned 2D embeddings with interpolation for new resolutions.
- Transformer encoder blocks: Multi-Head Self-Attention (MHSA) + MLP, with residual connections and LayerNorm.
- [CLS] token: prepended token used to aggregate the final representation for classification.
- Complexity: MHSA scales roughly O(N^2) in the number of tokens N (patches + [CLS]). Patch size controls N strongly.
How attention “sees” patches
Each attention head computes similarity between all pairs of tokens. Small patch size → many tokens → rich detail but heavy compute. Larger patch size → fewer tokens → faster, but potentially coarser features.
Worked examples
Example 1: Counting tokens from image size and patch size
- Image: 224×224, Patch: 16 → grid is 224/16 = 14 patches per side → 14×14 = 196 tokens.
- Add [CLS] token → total tokens = 196 + 1 = 197.
- If Patch: 32 → 224/32 = 7 per side → 49 + 1 = 50 tokens (much cheaper).
Example 2: Positional encodings across resolutions
- Trained at 224×224 (14×14 grid). Serving at 384×384 gives 24×24 patches with Patch: 16.
- Learned 2D positional embeddings (14×14) can be resized via bicubic interpolation to 24×24.
- Concatenate/insert the [CLS] embedding separately and add to tokens as usual.
Example 3: Estimating attention cost
- With 197 tokens (224×224, Patch 16 with [CLS]): pairs ≈ 197² = 38,809 interactions per head per layer.
- With 785 tokens (224×224, Patch 8 with [CLS]): pairs ≈ 785² = 616,225.
- Changing Patch 16 → 8 increases attention pairs by ~15.9× (616,225 / 38,809).
Exercises
Do these before the quick test. Solutions are collapsible.
Exercise 1 — Count tokens and estimate cost
Given an image 224×224:
- a) With Patch 16, how many tokens including [CLS]?
- b) With Patch 8, how many tokens including [CLS]?
- c) Roughly how many times more pairwise attention interactions does Patch 8 require than Patch 16?
Hint
- Tokens = (H/P × W/P) + 1.
- Relative cost ∝ N². Compare N² for both settings.
Show solution
a) 14×14 + 1 = 197 tokens.
b) 28×28 + 1 = 785 tokens.
c) 785² / 197² ≈ 616,225 / 38,809 ≈ 15.9×.
Exercise 2 — Positional embeddings at new resolution
Trained at 224×224 with learned 2D positional embeddings and Patch 16. Now you must serve at 384×384 (still Patch 16). Write a short plan (3–5 steps) to adapt the positional embeddings safely.
Hint
- Figure out old and new grid sizes.
- Interpolate the 2D pos-embed grid; keep [CLS] separate.
Show solution
- Old grid: 14×14; new grid: 24×24.
- Separate [CLS] embedding from the 2D patch grid.
- Reshape the learned grid to (14, 14, D) and bicubic-interpolate to (24, 24, D).
- Flatten back to (24×24, D) and prepend the [CLS] embedding.
- Add to token embeddings and run the encoder.
Alternative: use architectures with relative position bias that generalize more naturally to new sizes.
Self-check checklist
- I can compute token counts from image and patch sizes.
- I know why attention scales quadratically in token count.
- I can adapt positional embeddings when resolution changes.
- I can reason about the speed–accuracy trade-off when choosing patch sizes.
Common mistakes and how to self-check
- Forgetting the [CLS] token when counting tokens. Self-check: always add +1.
- Mismatched positional embeddings after changing resolution. Self-check: confirm the 2D grid was resized and the [CLS] handled separately.
- Choosing tiny patches without considering memory. Self-check: estimate N and N² before training.
- Assuming learned positional embeddings generalize automatically. Self-check: verify interpolation or use relative bias variants.
- Ignoring normalization/residual order. Self-check: confirm the block order (LayerNorm → MHSA/MLP → residual).
Practical projects
- Patch-size trade-off experiment: Train small ViT variants with Patch 16 vs Patch 8 on a subset dataset; compare accuracy, latency, and memory.
- Resolution scaling: Train at 224, serve at 384 with interpolated positional embeddings; measure performance changes.
- Transfer learning: Fine-tune a pretrained ViT on a domain-specific dataset (e.g., defects) with frozen early layers; track convergence speed.
Note: Report metrics with clear compute budgets. Accuracy vs. latency/memory is a practical trade-off. Varies by country/company; treat as rough ranges if you reference any costs.
Learning path
- Next: Hybrid CNN–Transformer encoders and relative position bias.
- Then: Hierarchical Transformers (e.g., shifted windows) for better scaling.
- Then: Data-efficient training tricks (distillation, strong augmentation).
- Finally: Detection/segmentation with Transformer backbones and end-to-end design ideas.
Next steps
- Memorize the token count formula and complexity intuition.
- Try one project above and document results.
- Attempt the quick test to confirm understanding.
Mini challenge
Your GPU budget halves overnight. You must keep accuracy stable. Propose one change to patch size, image resolution, or fine-tuning strategy to reduce attention cost while minimizing accuracy loss. Justify in 3–5 sentences.
Quick Test
The test is available to everyone. Only logged-in users have their progress saved.