The process of mapping different data modalities (text, images, audio, video) into a shared representation space where semantically related items from different modalities are close together. Multimodal alignment enables cross-modal search, retrieval, and understanding.
Multimodal alignment trains modality-specific encoders to produce embeddings in a shared vector space. Contrastive learning on paired data (image-caption pairs, audio-text pairs) pulls matching cross-modal pairs together while pushing non-matching pairs apart. After alignment, a text embedding can be compared directly with an image embedding to determine semantic similarity.
CLIP aligns images and text using 400M image-text pairs with InfoNCE loss. CLAP aligns audio and text similarly. ImageBind extends alignment to six modalities (image, text, audio, depth, thermal, IMU) using image as an anchor modality. The modality gap phenomenon means aligned modalities still occupy slightly different regions of the shared space. Alignment quality is measured by cross-modal retrieval recall (R@1, R@5, R@10).