Multimodal learning is a branch of machine learning that builds models capable of processing and relating information from multiple modalities such as text, images, video, audio, and structured data. These models learn shared representations that capture cross-modal relationships, enabling tasks like visual question answering, image captioning, and cross-modal retrieval.
Multimodal learning systems use separate encoder networks for each modality (e.g., a vision encoder for images, a text encoder for language) and then fuse the resulting representations through various strategies. Early fusion combines raw inputs, late fusion merges high-level features, and cross-attention mechanisms allow modalities to attend to each other during processing. The fused representations are trained end-to-end on multimodal objectives like contrastive learning, captioning, or visual grounding.
Common multimodal architectures include dual-encoder models (CLIP, ALIGN) that learn aligned embedding spaces through contrastive learning, encoder-decoder models (Flamingo, LLaVA) that condition language generation on visual inputs, and unified transformers (Gemini, GPT-4V) that tokenize all modalities into a shared sequence. Training data consists of paired examples (image-caption, video-text, audio-transcript) at scales from millions to billions of pairs. Loss functions include InfoNCE contrastive loss, cross-entropy for generation, and matching losses for alignment.