Multimodal foundation models are large-scale neural networks pretrained on diverse combinations of text, images, video, audio, and other data modalities. These models learn general-purpose representations that can be adapted to a wide range of downstream tasks including visual question answering, image generation, cross-modal retrieval, and multimodal reasoning, without task-specific architecture changes.
Multimodal foundation models process inputs from different modalities through specialized tokenizers and encoders that convert raw data into a shared token or embedding space. A large transformer backbone processes these unified representations, learning cross-modal relationships during pretraining on billions of image-text pairs, video-text pairs, or interleaved multimodal data. After pretraining, the model can be prompted, fine-tuned, or used zero-shot for tasks across any combination of its supported modalities.
Major architectures include contrastive models (CLIP, SigLIP) that align modalities in a shared embedding space, generative models (Flamingo, LLaVA, Gemini) that condition language generation on visual inputs, and unified sequence models (GPT-4V, Gemini) that tokenize all modalities into a single sequence. Pretraining objectives include contrastive learning (InfoNCE), next-token prediction, masked image/text modeling, and image-text matching. Model sizes range from hundreds of millions to trillions of parameters, trained on datasets of billions of multimodal examples.