The Per-Modality Problem
Most multimodal search systems today run separate embedding models for each content type. A video pipeline might use CLIP for frames, Whisper + BGE for transcripts, CLAP for audio events, and ColPali for embedded documents. Each model produces vectors in its own space with its own dimensionality, trained on its own data distribution.
This works, but it creates compounding problems:
Omnimodal embedding models eliminate these problems by projecting all modalities into a single shared vector space. A text embedding, an image embedding, an audio embedding, and a video embedding all live in the same space with the same dimensionality. Cross-modal search is just cosine similarity -- no bridging, no score normalization, no per-modality indexes.
What Makes a Model "Omnimodal"
The term "multimodal" has been stretched to cover everything from CLIP (text + image) to Whisper (audio to text). "Omnimodal" is more specific: it means a single model that produces dense vector embeddings across three or more modalities (minimally text, image, and audio; ideally also video and documents) in a shared representation space.
The key architectural requirements:
Shared Output Space
All modality encoders project into the same N-dimensional space. NVIDIA's Omni-Embed Nemotron uses 2048 dimensions; Microsoft's E5-Omni uses 3584 dimensions inherited from its Qwen2.5-Omni backbone. The critical property is that distance in this space is semantically meaningful across modalities: the vector for a spoken sentence should be close to the vector for the same sentence written as text, and close to a video frame depicting what the sentence describes.
Modality-Specific Encoders with Unified Heads
Omnimodal models don't process all modalities through the same encoder. Raw audio and raw images are fundamentally different data types -- waveforms vs. pixel grids -- and need different preprocessing. Instead, they use modality-specific input processing (audio tokenizers, vision patch embedders, text tokenizers) feeding into a shared transformer backbone that produces the final embedding.
The shared backbone is what creates the unified space. Models like E5-Omni add explicit alignment mechanisms on top:
Handling Temporal Modalities
Images and text are "static" inputs -- a single image or sentence maps to one embedding. Audio and video are temporal: a 30-second clip contains a sequence of information. Omnimodal models handle this in three ways:
1. Segment and embed. Split audio into fixed chunks (e.g., 30-second windows) and video into keyframe-based segments, embed each segment independently. This is how most production systems work because it creates manageable index units with timestamp alignment.
2. Pooled sequence encoding. Feed the entire temporal sequence through the encoder and pool the output (mean pooling, last-token pooling, attention-weighted pooling) into a single vector. E5-Omni uses last-token pooling. This captures more context but loses temporal granularity.
3. Multi-vector (ColBERT-style). Keep per-token embeddings and use late interaction scoring at retrieval time. ColQwen Omni takes this approach, producing token-level vectors for each modality. This preserves the most information but requires specialized indexes (e.g., PLAID) and more storage.
The Current Landscape
As of mid-2026, four omnimodal embedding models define the frontier:
E5-Omni (Microsoft, ~9B params)
State-of-the-art on MMEB-V2 (66.4 overall across 78 tasks). Built on Qwen2.5-Omni-7B with explicit cross-modal alignment (modality-aware temperature, controllable negative curriculum, batch whitening). Best audio retrieval among omnimodal models (37.7 Recall@1 on AudioCaps). MIT license.
When to use: Maximum retrieval quality matters more than inference cost. Research and production systems with GPU budget for a 9B model.
Omni-Embed Nemotron (NVIDIA, 4.7B params)
Built on the Thinker component of Qwen2.5-Omni-3B. 2048-dim embeddings. Best video retrieval scores among compared models (0.706 nDCG@10). Processes modalities independently (no interleaving), which simplifies batching.
When to use: Video-heavy workloads. Systems that need to mix modalities at ingest time without interleaving constraints.
ColQwen Omni (Vidore, ~3B params)
Extends the ColPali late-interaction paradigm to all modalities. Produces multi-vector representations instead of single dense vectors. Zero-shot audio retrieval without audio training data. Approximately 90 nDCG@5 on ViDoRe visual document retrieval.
When to use: Document-heavy workloads where late interaction scoring justifies the storage and index overhead. Systems that need to retrieve across documents, audio, and video with fine-grained matching.
VLM2Vec V2 (TIGER-Lab, ~2B params)
The smallest competitive model. Achieves 58.0 on MMEB-V2 -- competitive with 7B models from 2025. Built on Qwen2-VL-2B with LoRA fine-tuning. Focuses on image, video, and visual documents (no audio).
When to use: Cost-constrained deployments. High-throughput batch processing where embedding cost per item matters. Systems that don't need audio embedding.
Dense vs. Multi-Vector: The Retrieval Trade-Off
The choice between dense single-vector models (E5-Omni, Omni-Embed Nemotron) and multi-vector models (ColQwen Omni) is the most consequential architectural decision:
Dense embeddings:
Multi-vector (ColBERT-style) embeddings:
For most production systems, the pragmatic approach is a two-stage pipeline: use dense embeddings for first-stage retrieval (top 100-1000 candidates), then rerank with a multi-vector or cross-encoder model. This gets the precision of fine-grained matching without indexing overhead on the full collection.
Production Architecture
A production omnimodal embedding pipeline follows three stages:
Stage 1: Decompose
Split multimedia content into embeddable units:
Stage 2: Embed
Run the omnimodal model on all units. Because the model handles all modalities, you run a single model server instead of four:
# Before: four model calls per video
clip_vec = clip_model.encode(keyframe)
whisper_text = whisper_model.transcribe(audio)
bge_vec = bge_model.encode(whisper_text)
clap_vec = clap_model.encode(audio)
# After: one model, one embedding space
vec = omni_model.encode(segment) # works for any modality
Stage 3: Index
Store all embeddings in a single vector index. Because they share a space, you don't need per-modality indexes or score normalization:
# Single index for all modalities
index.add(vec, metadata={
"source_id": video_id,
"modality": "video_segment",
"start_time": 45.2,
"end_time": 52.8
})
At query time, the text query is encoded with the same model and compared against the entire index. Results from different modalities are directly comparable -- a 0.85 similarity to a video frame and a 0.83 similarity to an audio clip are on the same scale.
When Not to Use Omnimodal Models
Omnimodal models are not always the right choice:
The decision framework: if your retrieval pipeline crosses modality boundaries (text query to video results, audio query to document results), an omnimodal model simplifies architecture and often improves quality. If queries and results stay within the same modality, specialized models are simpler and faster.
Evaluation: MMEB-V2 and Beyond
The standard benchmark for omnimodal embeddings is MMEB-V2, introduced by the VLM2Vec team in 2025. It covers 78 datasets across three modality groups:
| Group | Tasks | Metric | Example Datasets |
| Image (36 tasks) | Classification, retrieval, VQA | Hit@1 | ImageNet, CIFAR, STS |
| Video (18 tasks) | Retrieval, moment retrieval, QA, classification | Hit@1 | MSRVTT, ActivityNet, FineVideo |
| VisDoc (24 tasks) | Document retrieval, page classification | nDCG@5 | ViDoRe, DocVQA, InfographicsVQA |
For production evaluation, MMEB-V2 scores correlate reasonably well with real-world retrieval quality, but two caveats apply:
1. Domain shift. MMEB-V2 benchmarks are academic datasets. If your content is surveillance footage, medical imaging, or niche industry documents, benchmark scores may not predict production performance. Always evaluate on a held-out set from your actual data.
2. Query distribution. Benchmarks use short, well-formed queries. Real users write messy queries ("that video from last tuesday with the graph"), use voice-to-text input with errors, or search in languages not well-represented in training data.