luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Vision Transformers Basics

Learn Vision Transformers Basics for free with explanations, exercises, and a quick test (for Computer Vision Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Vision Transformers (ViT) are a core family of modern vision models. You will face them when:

  • Fine-tuning a pretrained ViT for image classification, defect detection, or quality inspection.
  • Choosing patch size and resolution to balance accuracy vs. speed/memory for production.
  • Adapting positional embeddings when training at one resolution and serving at another.
  • Estimating compute/memory needs due to attention’s quadratic cost in token count.

Who this is for

  • Computer Vision Engineers who know CNNs and want a modern alternative.
  • ML practitioners deploying image models under latency/memory constraints.
  • Researchers or students needing a clear entry point into ViTs.

Prerequisites

  • Comfort with tensors, channels, height/width, and batching.
  • Basic understanding of attention and the Transformer encoder block.
  • Familiarity with image classification pipelines and evaluation.

Concept explained simply

Vision Transformers treat an image like a sentence. Instead of words, we use image patches. Each patch becomes a token. A Transformer encoder then uses self-attention to let any patch attend to any other patch, capturing global relationships early.

Mental model (20 seconds)
  • Image → grid of small tiles (patches).
  • Each tile → a vector (token) via a linear projection.
  • Add positional info so the model knows where each tile came from.
  • Self-attention mixes information across all tiles.
  • A special [CLS] token summarizes the image for classification.

Core components at a glance

  • Patch embedding: split image into P×P patches, flatten, project with a linear layer.
  • Positional encoding: learned or sinusoidal; tells the model patch locations. Often learned 2D embeddings with interpolation for new resolutions.
  • Transformer encoder blocks: Multi-Head Self-Attention (MHSA) + MLP, with residual connections and LayerNorm.
  • [CLS] token: prepended token used to aggregate the final representation for classification.
  • Complexity: MHSA scales roughly O(N^2) in the number of tokens N (patches + [CLS]). Patch size controls N strongly.
How attention “sees” patches

Each attention head computes similarity between all pairs of tokens. Small patch size → many tokens → rich detail but heavy compute. Larger patch size → fewer tokens → faster, but potentially coarser features.

Worked examples

Example 1: Counting tokens from image size and patch size

  • Image: 224×224, Patch: 16 → grid is 224/16 = 14 patches per side → 14×14 = 196 tokens.
  • Add [CLS] token → total tokens = 196 + 1 = 197.
  • If Patch: 32 → 224/32 = 7 per side → 49 + 1 = 50 tokens (much cheaper).

Example 2: Positional encodings across resolutions

  • Trained at 224×224 (14×14 grid). Serving at 384×384 gives 24×24 patches with Patch: 16.
  • Learned 2D positional embeddings (14×14) can be resized via bicubic interpolation to 24×24.
  • Concatenate/insert the [CLS] embedding separately and add to tokens as usual.

Example 3: Estimating attention cost

  • With 197 tokens (224×224, Patch 16 with [CLS]): pairs ≈ 197² = 38,809 interactions per head per layer.
  • With 785 tokens (224×224, Patch 8 with [CLS]): pairs ≈ 785² = 616,225.
  • Changing Patch 16 → 8 increases attention pairs by ~15.9× (616,225 / 38,809).

Exercises

Do these before the quick test. Solutions are collapsible.

Exercise 1 — Count tokens and estimate cost

Given an image 224×224:

  • a) With Patch 16, how many tokens including [CLS]?
  • b) With Patch 8, how many tokens including [CLS]?
  • c) Roughly how many times more pairwise attention interactions does Patch 8 require than Patch 16?
Hint
  • Tokens = (H/P × W/P) + 1.
  • Relative cost ∝ N². Compare N² for both settings.
Show solution

a) 14×14 + 1 = 197 tokens.

b) 28×28 + 1 = 785 tokens.

c) 785² / 197² ≈ 616,225 / 38,809 ≈ 15.9×.

Exercise 2 — Positional embeddings at new resolution

Trained at 224×224 with learned 2D positional embeddings and Patch 16. Now you must serve at 384×384 (still Patch 16). Write a short plan (3–5 steps) to adapt the positional embeddings safely.

Hint
  • Figure out old and new grid sizes.
  • Interpolate the 2D pos-embed grid; keep [CLS] separate.
Show solution
  1. Old grid: 14×14; new grid: 24×24.
  2. Separate [CLS] embedding from the 2D patch grid.
  3. Reshape the learned grid to (14, 14, D) and bicubic-interpolate to (24, 24, D).
  4. Flatten back to (24×24, D) and prepend the [CLS] embedding.
  5. Add to token embeddings and run the encoder.

Alternative: use architectures with relative position bias that generalize more naturally to new sizes.

Self-check checklist

  • I can compute token counts from image and patch sizes.
  • I know why attention scales quadratically in token count.
  • I can adapt positional embeddings when resolution changes.
  • I can reason about the speed–accuracy trade-off when choosing patch sizes.

Common mistakes and how to self-check

  • Forgetting the [CLS] token when counting tokens. Self-check: always add +1.
  • Mismatched positional embeddings after changing resolution. Self-check: confirm the 2D grid was resized and the [CLS] handled separately.
  • Choosing tiny patches without considering memory. Self-check: estimate N and N² before training.
  • Assuming learned positional embeddings generalize automatically. Self-check: verify interpolation or use relative bias variants.
  • Ignoring normalization/residual order. Self-check: confirm the block order (LayerNorm → MHSA/MLP → residual).

Practical projects

  • Patch-size trade-off experiment: Train small ViT variants with Patch 16 vs Patch 8 on a subset dataset; compare accuracy, latency, and memory.
  • Resolution scaling: Train at 224, serve at 384 with interpolated positional embeddings; measure performance changes.
  • Transfer learning: Fine-tune a pretrained ViT on a domain-specific dataset (e.g., defects) with frozen early layers; track convergence speed.

Note: Report metrics with clear compute budgets. Accuracy vs. latency/memory is a practical trade-off. Varies by country/company; treat as rough ranges if you reference any costs.

Learning path

  • Next: Hybrid CNN–Transformer encoders and relative position bias.
  • Then: Hierarchical Transformers (e.g., shifted windows) for better scaling.
  • Then: Data-efficient training tricks (distillation, strong augmentation).
  • Finally: Detection/segmentation with Transformer backbones and end-to-end design ideas.

Next steps

  • Memorize the token count formula and complexity intuition.
  • Try one project above and document results.
  • Attempt the quick test to confirm understanding.

Mini challenge

Your GPU budget halves overnight. You must keep accuracy stable. Propose one change to patch size, image resolution, or fine-tuning strategy to reduce attention cost while minimizing accuracy loss. Justify in 3–5 sentences.

Quick Test

The test is available to everyone. Only logged-in users have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Given an image of 224×224:

  • a) With Patch 16, how many tokens including [CLS]?
  • b) With Patch 8, how many tokens including [CLS]?
  • c) Roughly how many times more pairwise attention interactions does Patch 8 require than Patch 16?
Expected Output
a) 197 tokens. b) 785 tokens. c) ~15.9× more pairwise interactions for Patch 8 than Patch 16.

Vision Transformers Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Vision Transformers Basics?

AI Assistant

Ask questions about this tool