A multimodal task that localizes objects or regions in an image based on natural language descriptions. Visual grounding bridges the gap between text and vision, enabling text-based visual search and question answering over images.
Visual grounding takes a text expression like 'the red car on the left' and an image, then outputs a bounding box or segmentation mask for the referred object. Models encode both the text and image, compute cross-modal attention to find correspondences, and predict the region that best matches the description. This requires understanding spatial relationships, attributes, and object references.
Modern approaches include MDETR (modulated detection transformer), Grounding DINO, and GLIP that unify detection and grounding. These models are trained on referring expression datasets (RefCOCO) and grounding datasets (Flickr30K Entities). Architecture typically combines a text encoder, image encoder, and cross-modal fusion module. Open-vocabulary grounding models can localize objects described by arbitrary text phrases.