A computer vision task that identifies and localizes objects in images by predicting bounding boxes and class labels. Object detection is a core building block in multimodal pipelines for indexing visual content and enabling structured search.
Object detection models process an input image and output a set of bounding boxes, each with a class label and confidence score. Modern architectures fall into two categories: two-stage detectors (like Faster R-CNN) that first propose regions then classify them, and single-stage detectors (like YOLO and DETR) that predict boxes and classes in one pass for faster inference.
State-of-the-art models use transformer-based architectures (DETR, DINO) or efficient CNN backbones (YOLOv8) with feature pyramid networks for multi-scale detection. Non-maximum suppression (NMS) removes duplicate detections. Performance is measured using mean Average Precision (mAP) at various IoU thresholds. Models are typically pretrained on COCO (80 classes) or Objects365 (365 classes) and fine-tuned for specific domains.