A computer vision task that localizes human body keypoints (joints) to estimate body pose from images or video. Pose estimation enables action understanding, gesture recognition, and human-centric content indexing in multimodal systems.
Pose estimation models detect anatomical keypoints (head, shoulders, elbows, wrists, hips, knees, ankles) and connect them to form a skeleton representation of the human body. Top-down approaches first detect people then estimate their pose, while bottom-up approaches detect all keypoints first then group them into individuals.
Modern architectures include HRNet (high-resolution representations), ViTPose (vision transformer based), and lightweight models like MoveNet for real-time applications. Output is typically 17-25 keypoints with (x, y, confidence) per joint. COCO Keypoints and MPII Human Pose are standard benchmarks. Performance is measured using Object Keypoint Similarity (OKS) and PCK (Percentage of Correct Keypoints).
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS