A computer vision task that localizes human body keypoints (joints) to estimate body pose from images or video. Pose estimation enables action understanding, gesture recognition, and human-centric content indexing in multimodal systems.
Pose estimation models detect anatomical keypoints (head, shoulders, elbows, wrists, hips, knees, ankles) and connect them to form a skeleton representation of the human body. Top-down approaches first detect people then estimate their pose, while bottom-up approaches detect all keypoints first then group them into individuals.
Modern architectures include HRNet (high-resolution representations), ViTPose (vision transformer based), and lightweight models like MoveNet for real-time applications. Output is typically 17-25 keypoints with (x, y, confidence) per joint. COCO Keypoints and MPII Human Pose are standard benchmarks. Performance is measured using Object Keypoint Similarity (OKS) and PCK (Percentage of Correct Keypoints).