Omnimodal Embeddings: One Model for Text, Image, Audio, and Video Retrieval

The embedding model landscape: text-only models (E5, BGE, Cohere) for pure text, vision-language models (CLIP, SigLIP, OpenCLIP) for image plus text search, and multimodal models (Vertex 1408D, ImageBind, ONE-PEACE) when video, audio, images, and text must share one space.

The Per-Modality Problem

Most multimodal search systems today run separate embedding models for each content type. A video pipeline might use CLIP for frames, Whisper + BGE for transcripts, CLAP for audio events, and ColPali for embedded documents. Each model produces vectors in its own space with its own dimensionality, trained on its own data distribution.

This works, but it creates compounding problems:

Cross-modal search requires translation. To find a video clip by describing its audio, you need a query path that maps text to the audio embedding space. Each new cross-modal combination requires explicit bridging logic.

Index proliferation. Five modality-specific models means five separate vector indexes, five sets of embedding maintenance, five version-upgrade cycles. Storage and operational overhead scale linearly with modality count.

Score incomparability. A cosine similarity of 0.85 from CLIP and 0.72 from BGE measure different things. Fusing results from multiple modality-specific retrievers requires learned or heuristic score normalization -- a persistent source of retrieval quality bugs.

Latency multiplication. Querying across modalities means running the query encoder for each target space. A text query that should search frames, audio, and transcripts simultaneously requires three encoder forward passes.

Omnimodal embedding models eliminate these problems by projecting all modalities into a single shared vector space. A text embedding, an image embedding, an audio embedding, and a video embedding all live in the same space with the same dimensionality. Cross-modal search is just cosine similarity -- no bridging, no score normalization, no per-modality indexes.

What Makes a Model "Omnimodal"

The term "multimodal" has been stretched to cover everything from CLIP (text + image) to Whisper (audio to text). "Omnimodal" is more specific: it means a single model that produces dense vector embeddings across three or more modalities (minimally text, image, and audio; ideally also video and documents) in a shared representation space.

The key architectural requirements:

Shared Output Space

All modality encoders project into the same N-dimensional space. NVIDIA's Omni-Embed Nemotron uses 2048 dimensions; Microsoft's E5-Omni uses 3584 dimensions inherited from its Qwen2.5-Omni backbone. The critical property is that distance in this space is semantically meaningful across modalities: the vector for a spoken sentence should be close to the vector for the same sentence written as text, and close to a video frame depicting what the sentence describes.

Modality-Specific Encoders with Unified Heads

Omnimodal models don't process all modalities through the same encoder. Raw audio and raw images are fundamentally different data types -- waveforms vs. pixel grids -- and need different preprocessing. Instead, they use modality-specific input processing (audio tokenizers, vision patch embedders, text tokenizers) feeding into a shared transformer backbone that produces the final embedding.

The shared backbone is what creates the unified space. Models like E5-Omni add explicit alignment mechanisms on top:

Modality-aware temperature calibration -- different modalities have different "natural" similarity distributions. Audio-text pairs tend to cluster tighter than image-text pairs. Per-modality temperature scaling prevents one modality from dominating retrieval scores.

Cross-modal contrastive training -- the model sees positive pairs across all modality combinations (text-image, text-audio, image-video, etc.) during training, forcing the shared space to be genuinely cross-modal rather than just concatenated per-modality spaces.

Batch whitening -- decorrelates embedding dimensions and normalizes the covariance structure so that the geometry of the space is consistent regardless of input modality.

Handling Temporal Modalities

Images and text are "static" inputs -- a single image or sentence maps to one embedding. Audio and video are temporal: a 30-second clip contains a sequence of information. Omnimodal models handle this in three ways:

1. Segment and embed. Split audio into fixed chunks (e.g., 30-second windows) and video into keyframe-based segments, embed each segment independently. This is how most production systems work because it creates manageable index units with timestamp alignment.

2. Pooled sequence encoding. Feed the entire temporal sequence through the encoder and pool the output (mean pooling, last-token pooling, attention-weighted pooling) into a single vector. E5-Omni uses last-token pooling. This captures more context but loses temporal granularity.

3. Multi-vector (ColBERT-style). Keep per-token embeddings and use late interaction scoring at retrieval time. ColQwen Omni takes this approach, producing token-level vectors for each modality. This preserves the most information but requires specialized indexes (e.g., PLAID) and more storage.

The Current Landscape

As of mid-2026, four omnimodal embedding models define the frontier:

E5-Omni (Microsoft, ~9B params)

State-of-the-art on MMEB-V2 (66.4 overall across 78 tasks). Built on Qwen2.5-Omni-7B with explicit cross-modal alignment (modality-aware temperature, controllable negative curriculum, batch whitening). Best audio retrieval among omnimodal models (37.7 Recall@1 on AudioCaps). MIT license.

When to use: Maximum retrieval quality matters more than inference cost. Research and production systems with GPU budget for a 9B model.

Omni-Embed Nemotron (NVIDIA, 4.7B params)

Built on the Thinker component of Qwen2.5-Omni-3B. 2048-dim embeddings. Best video retrieval scores among compared models (0.706 nDCG@10). Processes modalities independently (no interleaving), which simplifies batching.

When to use: Video-heavy workloads. Systems that need to mix modalities at ingest time without interleaving constraints.

ColQwen Omni (Vidore, ~3B params)

Extends the ColPali late-interaction paradigm to all modalities. Produces multi-vector representations instead of single dense vectors. Zero-shot audio retrieval without audio training data. Approximately 90 nDCG@5 on ViDoRe visual document retrieval.

When to use: Document-heavy workloads where late interaction scoring justifies the storage and index overhead. Systems that need to retrieve across documents, audio, and video with fine-grained matching.

VLM2Vec V2 (TIGER-Lab, ~2B params)

The smallest competitive model. Achieves 58.0 on MMEB-V2 -- competitive with 7B models from 2025. Built on Qwen2-VL-2B with LoRA fine-tuning. Focuses on image, video, and visual documents (no audio).

When to use: Cost-constrained deployments. High-throughput batch processing where embedding cost per item matters. Systems that don't need audio embedding.

Dense vs. Multi-Vector: The Retrieval Trade-Off

The choice between dense single-vector models (E5-Omni, Omni-Embed Nemotron) and multi-vector models (ColQwen Omni) is the most consequential architectural decision:

Dense embeddings:

One vector per item (e.g., 2048 dims x 4 bytes = 8 KB per item)

Standard ANN indexes (HNSW, IVF) work out of the box

Retrieval is a single nearest-neighbor lookup -- fast and well-optimized

Score fusion across modalities is trivial (same space, same dimensionality)

Trade-off: coarse-grained matching. The model must compress everything about an image, audio clip, or document into one vector.

Multi-vector (ColBERT-style) embeddings:

N vectors per item (e.g., 200 tokens x 128 dims = 100 KB per document page)

Requires late interaction scoring: score = sum of max similarities between query tokens and document tokens

Specialized indexes (PLAID, ColBERT Engine) needed for efficient retrieval

10-50x more storage per item

Trade-off: token-level precision. The model preserves fine-grained information at the cost of storage and retrieval complexity.

For most production systems, the pragmatic approach is a two-stage pipeline: use dense embeddings for first-stage retrieval (top 100-1000 candidates), then rerank with a multi-vector or cross-encoder model. This gets the precision of fine-grained matching without indexing overhead on the full collection.

Production Architecture

A production omnimodal embedding pipeline follows three stages:

Stage 1: Decompose

Split multimedia content into embeddable units:

Video -- segments (scene detection) + keyframes + audio track

Audio -- fixed-length chunks (15-30s) with overlap

Documents -- page images (for visual doc retrieval) or extracted text chunks

Images -- passed through directly

Stage 2: Embed

Run the omnimodal model on all units. Because the model handles all modalities, you run a single model server instead of four:

# Before: four model calls per video
clip_vec = clip_model.encode(keyframe)
whisper_text = whisper_model.transcribe(audio)
bge_vec = bge_model.encode(whisper_text)
clap_vec = clap_model.encode(audio)

# After: one model, one embedding space
vec = omni_model.encode(segment)  # works for any modality

Stage 3: Index

Store all embeddings in a single vector index. Because they share a space, you don't need per-modality indexes or score normalization:

# Single index for all modalities
index.add(vec, metadata={
    "source_id": video_id,
    "modality": "video_segment",
    "start_time": 45.2,
    "end_time": 52.8
})

At query time, the text query is encoded with the same model and compared against the entire index. Results from different modalities are directly comparable -- a 0.85 similarity to a video frame and a 0.83 similarity to an audio clip are on the same scale.

When Not to Use Omnimodal Models

Omnimodal models are not always the right choice:

Single-modality workloads. If you only index text documents, a specialized text embedding model (BGE, Qwen3-Embedding) will outperform the text encoding of an omnimodal model. Specialization still wins within a modality.

Latency-critical paths. Omnimodal models are 3-9B parameters. If your p99 latency budget is under 10ms for encoding, a 110M-parameter MiniLM is a better fit for the text path.

No cross-modal queries. If users never search audio content using text queries, or never search video using audio, the unified space adds complexity without benefit. Per-modality models are simpler to operate.

Regulatory constraints. NVIDIA's Omni-Embed Nemotron is non-commercial. E5-Omni (MIT) and VLM2Vec V2 (Apache 2.0) are commercially friendly, but verify license compatibility for your deployment.

The decision framework: if your retrieval pipeline crosses modality boundaries (text query to video results, audio query to document results), an omnimodal model simplifies architecture and often improves quality. If queries and results stay within the same modality, specialized models are simpler and faster.

Evaluation: MMEB-V2 and Beyond

The standard benchmark for omnimodal embeddings is MMEB-V2, introduced by the VLM2Vec team in 2025. It covers 78 datasets across three modality groups:

Group

Tasks

Metric

Example Datasets

Image (36 tasks)	Classification, retrieval, VQA	Hit@1	ImageNet, CIFAR, STS
Video (18 tasks)	Retrieval, moment retrieval, QA, classification	Hit@1	MSRVTT, ActivityNet, FineVideo
VisDoc (24 tasks)	Document retrieval, page classification	nDCG@5	ViDoRe, DocVQA, InfographicsVQA

Audio evaluation is less standardized. AudioCaps Recall@1 is the most common metric, but coverage is thin compared to vision and text benchmarks.

For production evaluation, MMEB-V2 scores correlate reasonably well with real-world retrieval quality, but two caveats apply:

1. Domain shift. MMEB-V2 benchmarks are academic datasets. If your content is surveillance footage, medical imaging, or niche industry documents, benchmark scores may not predict production performance. Always evaluate on a held-out set from your actual data.

2. Query distribution. Benchmarks use short, well-formed queries. Real users write messy queries ("that video from last tuesday with the graph"), use voice-to-text input with errors, or search in languages not well-represented in training data.

The Per-Modality Problem

What Makes a Model "Omnimodal"

Shared Output Space

Modality-Specific Encoders with Unified Heads

Handling Temporal Modalities

The Current Landscape

E5-Omni (Microsoft, ~9B params)

Omni-Embed Nemotron (NVIDIA, 4.7B params)

ColQwen Omni (Vidore, ~3B params)

VLM2Vec V2 (TIGER-Lab, ~2B params)

Dense vs. Multi-Vector: The Retrieval Trade-Off

Production Architecture

Stage 1: Decompose

Stage 2: Embed

Stage 3: Index

When Not to Use Omnimodal Models

Evaluation: MMEB-V2 and Beyond

Further Reading

Put multimodal search to work

Already have vectors?

Run this on your own data

Related guides

Multimodal Chunking Strategies: How to Decompose Video, Audio, Images, and Documents for Search

Cross-Encoder Reranking: Why Two-Stage Retrieval Beats One-Stage Search

Audio Feature Extraction: How AI Agents Learn to Hear