A multimodal task that localizes objects or regions in an image based on natural language descriptions. Visual grounding bridges the gap between text and vision, enabling text-based visual search and question answering over images.
Visual grounding takes a text expression like 'the red car on the left' and an image, then outputs a bounding box or segmentation mask for the referred object. Models encode both the text and image, compute cross-modal attention to find correspondences, and predict the region that best matches the description. This requires understanding spatial relationships, attributes, and object references.
Modern approaches include MDETR (modulated detection transformer), Grounding DINO, and GLIP that unify detection and grounding. These models are trained on referring expression datasets (RefCOCO) and grounding datasets (Flickr30K Entities). Architecture typically combines a text encoder, image encoder, and cross-modal fusion module. Open-vocabulary grounding models can localize objects described by arbitrary text phrases.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS