The process of mapping different data modalities (text, images, audio, video) into a shared representation space where semantically related items from different modalities are close together. Multimodal alignment enables cross-modal search, retrieval, and understanding.
Multimodal alignment trains modality-specific encoders to produce embeddings in a shared vector space. Contrastive learning on paired data (image-caption pairs, audio-text pairs) pulls matching cross-modal pairs together while pushing non-matching pairs apart. After alignment, a text embedding can be compared directly with an image embedding to determine semantic similarity.
CLIP aligns images and text using 400M image-text pairs with InfoNCE loss. CLAP aligns audio and text similarly. ImageBind extends alignment to six modalities (image, text, audio, depth, thermal, IMU) using image as an anchor modality. The modality gap phenomenon means aligned modalities still occupy slightly different regions of the shared space. Alignment quality is measured by cross-modal retrieval recall (R@1, R@5, R@10).
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS