Why this matters
Keypoints and pose estimation power features like workout form feedback, gesture control, AR effects, sports analytics, and ergonomics checks. As a Computer Vision Engineer, you will:
- Detect body joints (keypoints) and connect them into skeletons.
- Design pipelines: person detection, pose estimation, post-processing.
- Optimize for speed/accuracy on edge and server.
- Evaluate and debug predictions with metrics like OKS/mAP and PCK.
Who this is for
- Engineers and students building human- or object-keypoint features.
- Practitioners moving from object detection to pose estimation.
- Teams shipping pose on mobile, embedded, or cloud.
Prerequisites
- Comfort with convolutional networks and image tensors.
- Basics of object detection (bboxes, NMS).
- Python/NumPy and matrix transforms (scaling, translation, affine).
Concept explained simply
Keypoint detection finds specific landmark coordinates (e.g., nose, elbows). A pose model predicts multiple keypoints and connects them into a skeleton. Two common pipelines:
- Top-down: Detect persons first, crop each box, run a single-person pose model per crop. Pros: accurate per person. Cons: slower with many people.
- Bottom-up: Predict all keypoints at once plus how they connect (e.g., Part Affinity Fields). Pros: scale better to many people. Cons: grouping can be complex.
Two common output heads:
- Heatmap-based: For each keypoint type, the model outputs a 2D heatmap where peaks indicate likely locations. You extract coordinates via argmax or soft-argmax. Often higher accuracy.
- Direct regression: Model predicts coordinates directly. Often faster, can be less precise at high resolution.
Key terms
- Heatmap: per-keypoint 2D probability-like map.
- Soft-argmax: expectation of coordinates using a softmax-normalized heatmap values.
- PAF (Part Affinity Field): 2D vector fields indicating limb directions to connect pairs of keypoints (bottom-up).
- OKS (Object Keypoint Similarity): metric similar to IoU but for keypoints, normalizing errors by object scale and per-keypoint tolerance.
- PCK/PCKh: Percentage of Correct Keypoints within a distance threshold (often normalized by head size or person size).
Mental model
Imagine a set of glowing spots (heatmaps) for each joint type over the image. The brightest spots are likely joint centers. In top-down, you zoom into each detected person and find their spots; in bottom-up, you see all spots at once and then draw lines (PAFs) to assemble full skeletons.
Worked examples
Example 1: From heatmap to coordinate (soft-argmax)
Given a 3x3 heatmap logits (zero-based indices):
[ [0, 1, 0], [0, 3, 0], [0, 0, 0] ]
Softmax over all 9 cells gives a strong peak at (x=1, y=1). The expected coordinate via soft-argmax is approximately (x=1.00, y=0.94). If this heatmap corresponds to a 192x192 input (scale factor 64), the image-space coordinate ≈ (64.0, 60.3).
Why soft-argmax?
It provides sub-pixel precision and remains differentiable, enabling end-to-end training without non-differentiable argmax.
Example 2: Mapping from crop back to original image
Original image: 1280x720. Person bbox: (x0=320, y0=120, w=320, h=320). Model input: 256x256. Predicted keypoint in crop: (180, 100). Scale = w/256 = 1.25. Mapped back: (x, y) = (320 + 180*1.25, 120 + 100*1.25) = (545, 245).
Tip
Keep transforms in one place. Store the forward affine (image->crop) and invert it for consistent back-projection.
Example 3: OKS intuition
Suppose a single keypoint has error d=5 px, object scale s=100 (simplified), keypoint sigma k=0.1. OKS = exp(-(d^2)/(2 s^2 k^2)) = exp(-25 / 200) ≈ exp(-0.125) ≈ 0.88. Larger people (bigger s) tolerate more pixel error.
Example 4: Top-down vs bottom-up trade-off
- Scene with 1 person: top-down runs 1 detector + 1 pose head; bottom-up runs 1 network once.
- Scene with 10 people: top-down runs detector + 10 pose heads (cost grows with people); bottom-up still runs once, then groups.
Rule of thumb
Top-down is usually best when person count is small and precision matters per instance. Bottom-up shines in crowded scenes.