Multimodal embeddings are dense vector representations that map data from multiple modalities -- text, images, video frames, audio segments -- into a unified high-dimensional space where semantic similarity can be measured across modalities. A text description like 'golden retriever playing in snow' and a photo of that scene would have nearby vectors, enabling you to search images with text queries, find similar videos using a reference image, or cluster content across modalities by theme.
Multimodal embedding models use paired or aligned encoders to project different data types into the same vector space. During training, the model learns from large datasets of paired examples (e.g., images with captions, videos with descriptions) to place semantically similar items close together regardless of their modality. At inference time, any input -- a text query, an image, an audio clip -- is encoded into a fixed-length vector. Distance metrics like cosine similarity then measure how related any two items are, even if they are from different modalities.