Why this matters
Many CV models expect fixed input sizes (e.g., 640×640). If you stretch images to fit, you distort objects and harm detection, keypoints, and segmentation accuracy. Preserving aspect ratio and padding correctly keeps geometry consistent, making training and inference reliable.
- Detection: Letterbox without distorting boxes; adjust box coordinates with scale and pad offsets.
- Segmentation: Pad images and masks identically to keep pixel alignment.
- Classification: Resize and crop strategies affect what the model learns; pick deliberately.
- Deployment: Align to model stride (e.g., 32) for clean downsampling and stable performance.
Who this is for
- Beginners who know basic image tensors and want robust preprocessing.
- Practitioners integrating detectors/segmenters that require fixed-size inputs.
- Engineers building reproducible, production-ready pipelines.
Prerequisites
- Basic image concepts: width, height, channels; pixel coordinate systems (x to the right, y down).
- Familiarity with bounding boxes (x1, y1, x2, y2), masks, and keypoints.
- Basic transforms: resize, crop, pad.
Concept explained simply
Goal: fit any image into a fixed-size input without changing its original shape. We scale uniformly until one side fits, then pad the shortfall with pixels.
Letterbox recipe (square target S):
- Compute scale r = min(S / W, S / H).
- Resize to (W' = round(W × r), H' = round(H × r)).
- Pad to reach S × S. If symmetric, split padding between both sides.
Adjust labels after transform:
- Boxes: x' = x × r + pad_left; y' = y × r + pad_top (apply to both corners).
- Masks: resize mask with nearest-neighbor, then pad exactly like the image.
- Keypoints: scale and then add padding offsets.
Mental model
Think of projecting the original image onto a canvas. First, shrink or enlarge uniformly (no stretching). Then center it on a larger canvas of the target size and fill the empty borders. Any label is just transported by the same projection: scale, then shift.
Key techniques you will use
- Letterbox padding (constant, replicate, reflect). Reflect is often safest for segmentation boundaries; constant is common for detection.
- Center-crop after resizing the short side for classification pipelines.
- Stride alignment: pad final dimensions to multiples of model stride (e.g., 32).
- Interpolation: bilinear/bicubic for images; nearest for masks/labels.
- Rounding: floor/round resized dims, then pad the remainder to hit the exact target.
When to crop vs. pad?
- Detection/Segmentation: Prefer letterbox to avoid cutting objects.
- Classification: Often resize short side then (random) crop; small aspect deviations are acceptable.
- Tiny objects near edges: Avoid aggressive crop; prefer pad.
Worked examples
Example 1 — Letterbox 800×600 to 640×640 with a bounding box
- Input: W=800, H=600, S=640
- Scale: r = min(640/800=0.80, 640/600≈1.07) = 0.80
- Resized: W'=640, H'=480
- Padding needed: width=0, height=640-480=160 ⇒ top=80, bottom=80
- Box original: (x1=100, y1=150, x2=500, y2=450)
- Scale box: (80, 120, 400, 360)
- Add pad: (80, 200, 400, 440)
Example 2 — 1920×1080 to 640×640
- r = min(640/1920≈0.333, 640/1080≈0.593) = 0.333
- W'≈640, H'≈360
- Pad: height=280 ⇒ top=140, bottom=140
- Left/Right pad: 0 (already 640 wide)
Example 3 — Classification: resize-shortest-side then center-crop
- Input: 500×300, target 224
- Resize shortest side (300→224): r=224/300≈0.7467 ⇒ W'≈373, H'=224
- Center-crop to 224×224: crop width 373→224 ⇒ remove 149 px total ⇒ left 74, right 75
- No padding; slight composition change but no aspect distortion.
Step-by-step letterbox (to 640×640)
r = min(640/W, 640/H)
W' = round(W × r), H' = round(H × r)
dw = 640 - W'; dh = 640 - H'; pad_left = floor(dw/2); pad_right = dw - pad_left; pad_top = floor(dh/2); pad_bottom = dh - pad_top
Boxes: scale then add (pad_left, pad_top); Masks: nearest-neighbor resize + identical pad.
If needed, adjust pads so final size is divisible by model stride (e.g., 32).
Common mistakes and self-check
- Stretching images to fit: Look for elongated objects. Fix by letterbox.
- Forgetting to transform labels: Visualize a few samples to verify boxes and masks align.
- Using bilinear for masks: Causes soft edges. Use nearest-neighbor for discrete labels.
- Padding color artifacts: Constant black/white can bias the model. Prefer mid-gray for detection or reflect for segmentation.
- Off-by-one rounding: Ensure final size matches exactly; recompute pads from the actual rounded resized dims.
- Ignoring stride: If model stride is 32, ensure final dims are multiples of 32.
Self-check mini-audit
- Pick 16 random images with labels. After preprocessing, overlay labels and inspect corners/edges.
- Confirm final sizes are correct and divisible by stride where required.
- Check padding color choice vs. your normalization scheme.
Practice exercises
These mirror the exercises below. Try on paper or in a notebook, then compare.
- Exercise 1: Letterbox 800×600 to 640×640. Use r=min(640/800, 640/600). Compute resized dims, pad top/bottom/left/right. Transform the box (100,150,500,450).
- Exercise 2: Stride-only padding. Keep image size but pad to multiples of 32. For 1023×615, compute minimal symmetric pads to reach 1024×640 using constant value 114 (no resizing). What are top, bottom, left, right pads?
- Checklist:
- Did you compute r from the original image?
- Did you base pad calculation on the rounded resized dims?
- Did you apply both scaling and offsets to box coordinates?
- Are final sizes divisible by model stride when needed?
Practical projects
- Implement a letterbox function that returns the transformed image plus (r, pad_left, pad_top) and a helper to update boxes, masks, and keypoints.
- Add stride-aware padding to your dataloader; toggle between constant and reflect padding and visualize effects.
- Audit pipeline: randomly sample 50 images, draw boxes/masks after preprocessing, and export a contact sheet to validate alignment.
Learning path
- Start here: aspect ratio, padding types, label transforms.
- Next: geometric augmentations (flip/rotate) while keeping labels consistent.
- Then: photometric augmentations (brightness/contrast) that don’t affect geometry.
- Finally: batching and mixed-size training with stride-aligned padding.
Next steps
- Integrate letterbox into your training and inference pipelines and verify label alignment visually.
- Experiment with padding colors and reflect padding; measure impact on validation metrics.
- Proceed to the quick test below to lock in the concepts.
Quick test
Available to everyone; only logged-in users get saved progress.
Mini challenge
You must support both 1280×720 and 1024×768 inputs for a detector with 32-stride and 640 input. Design one preprocessing function that: (1) preserves aspect ratio, (2) outputs 640×640 divisible by 32, (3) adjusts boxes and masks, and (4) lets you switch among constant mid-gray and reflect padding via a parameter. Sketch the function signature, the order of operations, and how you will unit-test label alignment.