A state-of-the-art object detection model that can locate and identify objects in images based on arbitrary text descriptions, without being limited to a fixed set of predefined categories.
Grounding DINO combines a text encoder with an image encoder and a detection head to perform open-vocabulary object detection. Given an image and a text prompt describing what to find (e.g., 'red car' or 'person wearing a helmet'), the model identifies bounding boxes around matching objects in the image along with confidence scores. Unlike traditional object detectors that are limited to categories seen during training, Grounding DINO can detect any object described in natural language, making it highly flexible for diverse applications.
The architecture extends the DINO (DETR with Improved deNoising anchOr boxes) detection framework with language-guided feature fusion. Text prompts are encoded using a transformer language model, and these text features are fused with image features from a vision backbone through cross-attention layers. The detection head produces bounding boxes, class labels, and confidence scores for each detected object. The model supports multiple objects per prompt and can handle multi-phrase queries with period-separated categories. Grounding DINO is particularly useful in Mixpeek pipelines for extracting object-level features from images and video frames.