The ability to search and retrieve data in one modality (e.g., images) using a query in a different modality (e.g., text). Cross-modal retrieval is the defining capability of multimodal AI systems that unify search across text, images, audio, and video.
Cross-modal retrieval uses models trained to map different data types into a shared embedding space where semantically related items from different modalities are close together. A text query like 'sunset over the ocean' is encoded into the same vector space as images, allowing direct similarity comparison. This is achieved through contrastive training on paired multimodal data.
Foundation models include CLIP (text-image), CLAP (text-audio), and ImageBind (six modalities). These models use dual encoders trained with contrastive loss on paired data (image-caption pairs, audio-text pairs). At retrieval time, the query embedding from one modality is compared against pre-computed embeddings of the target modality. Zero-shot cross-modal retrieval works without task-specific training data.