NEWManaged multimodal retrieval.Explore platform →
    Agent Perception
    18 min read
    Updated 2026-05-15

    Omnimodal Embeddings: One Model for Text, Image, Audio, and Video Retrieval

    How a new generation of embedding models replaces per-modality pipelines with a single unified vector space. Covers the architecture shift, cross-modal alignment techniques, production trade-offs, and when to use omnimodal vs. specialized models.

    Embeddings
    Multimodal
    Retrieval
    Agent Perception
    Cross-Modal Search

    The Per-Modality Problem



    Most multimodal search systems today run separate embedding models for each content type. A video pipeline might use CLIP for frames, Whisper + BGE for transcripts, CLAP for audio events, and ColPali for embedded documents. Each model produces vectors in its own space with its own dimensionality, trained on its own data distribution.

    This works, but it creates compounding problems:

  1. Cross-modal search requires translation. To find a video clip by describing its audio, you need a query path that maps text to the audio embedding space. Each new cross-modal combination requires explicit bridging logic.
  2. Index proliferation. Five modality-specific models means five separate vector indexes, five sets of embedding maintenance, five version-upgrade cycles. Storage and operational overhead scale linearly with modality count.
  3. Score incomparability. A cosine similarity of 0.85 from CLIP and 0.72 from BGE measure different things. Fusing results from multiple modality-specific retrievers requires learned or heuristic score normalization -- a persistent source of retrieval quality bugs.
  4. Latency multiplication. Querying across modalities means running the query encoder for each target space. A text query that should search frames, audio, and transcripts simultaneously requires three encoder forward passes.


  5. Omnimodal embedding models eliminate these problems by projecting all modalities into a single shared vector space. A text embedding, an image embedding, an audio embedding, and a video embedding all live in the same space with the same dimensionality. Cross-modal search is just cosine similarity -- no bridging, no score normalization, no per-modality indexes.

    What Makes a Model "Omnimodal"



    The term "multimodal" has been stretched to cover everything from CLIP (text + image) to Whisper (audio to text). "Omnimodal" is more specific: it means a single model that produces dense vector embeddings across three or more modalities (minimally text, image, and audio; ideally also video and documents) in a shared representation space.

    The key architectural requirements:

    Shared Output Space



    All modality encoders project into the same N-dimensional space. NVIDIA's Omni-Embed Nemotron uses 2048 dimensions; Microsoft's E5-Omni uses 3584 dimensions inherited from its Qwen2.5-Omni backbone. The critical property is that distance in this space is semantically meaningful across modalities: the vector for a spoken sentence should be close to the vector for the same sentence written as text, and close to a video frame depicting what the sentence describes.

    Modality-Specific Encoders with Unified Heads



    Omnimodal models don't process all modalities through the same encoder. Raw audio and raw images are fundamentally different data types -- waveforms vs. pixel grids -- and need different preprocessing. Instead, they use modality-specific input processing (audio tokenizers, vision patch embedders, text tokenizers) feeding into a shared transformer backbone that produces the final embedding.

    The shared backbone is what creates the unified space. Models like E5-Omni add explicit alignment mechanisms on top:

  6. Modality-aware temperature calibration -- different modalities have different "natural" similarity distributions. Audio-text pairs tend to cluster tighter than image-text pairs. Per-modality temperature scaling prevents one modality from dominating retrieval scores.
  7. Cross-modal contrastive training -- the model sees positive pairs across all modality combinations (text-image, text-audio, image-video, etc.) during training, forcing the shared space to be genuinely cross-modal rather than just concatenated per-modality spaces.
  8. Batch whitening -- decorrelates embedding dimensions and normalizes the covariance structure so that the geometry of the space is consistent regardless of input modality.


  9. Handling Temporal Modalities



    Images and text are "static" inputs -- a single image or sentence maps to one embedding. Audio and video are temporal: a 30-second clip contains a sequence of information. Omnimodal models handle this in three ways:

    1. Segment and embed. Split audio into fixed chunks (e.g., 30-second windows) and video into keyframe-based segments, embed each segment independently. This is how most production systems work because it creates manageable index units with timestamp alignment.

    2. Pooled sequence encoding. Feed the entire temporal sequence through the encoder and pool the output (mean pooling, last-token pooling, attention-weighted pooling) into a single vector. E5-Omni uses last-token pooling. This captures more context but loses temporal granularity.

    3. Multi-vector (ColBERT-style). Keep per-token embeddings and use late interaction scoring at retrieval time. ColQwen Omni takes this approach, producing token-level vectors for each modality. This preserves the most information but requires specialized indexes (e.g., PLAID) and more storage.

    The Current Landscape



    As of mid-2026, four omnimodal embedding models define the frontier:

    E5-Omni (Microsoft, ~9B params)



    State-of-the-art on MMEB-V2 (66.4 overall across 78 tasks). Built on Qwen2.5-Omni-7B with explicit cross-modal alignment (modality-aware temperature, controllable negative curriculum, batch whitening). Best audio retrieval among omnimodal models (37.7 Recall@1 on AudioCaps). MIT license.

    When to use: Maximum retrieval quality matters more than inference cost. Research and production systems with GPU budget for a 9B model.

    Omni-Embed Nemotron (NVIDIA, 4.7B params)



    Built on the Thinker component of Qwen2.5-Omni-3B. 2048-dim embeddings. Best video retrieval scores among compared models (0.706 nDCG@10). Processes modalities independently (no interleaving), which simplifies batching.

    When to use: Video-heavy workloads. Systems that need to mix modalities at ingest time without interleaving constraints.

    ColQwen Omni (Vidore, ~3B params)



    Extends the ColPali late-interaction paradigm to all modalities. Produces multi-vector representations instead of single dense vectors. Zero-shot audio retrieval without audio training data. Approximately 90 nDCG@5 on ViDoRe visual document retrieval.

    When to use: Document-heavy workloads where late interaction scoring justifies the storage and index overhead. Systems that need to retrieve across documents, audio, and video with fine-grained matching.

    VLM2Vec V2 (TIGER-Lab, ~2B params)



    The smallest competitive model. Achieves 58.0 on MMEB-V2 -- competitive with 7B models from 2025. Built on Qwen2-VL-2B with LoRA fine-tuning. Focuses on image, video, and visual documents (no audio).

    When to use: Cost-constrained deployments. High-throughput batch processing where embedding cost per item matters. Systems that don't need audio embedding.

    Dense vs. Multi-Vector: The Retrieval Trade-Off



    The choice between dense single-vector models (E5-Omni, Omni-Embed Nemotron) and multi-vector models (ColQwen Omni) is the most consequential architectural decision:

    Dense embeddings:
  10. One vector per item (e.g., 2048 dims x 4 bytes = 8 KB per item)
  11. Standard ANN indexes (HNSW, IVF) work out of the box
  12. Retrieval is a single nearest-neighbor lookup -- fast and well-optimized
  13. Score fusion across modalities is trivial (same space, same dimensionality)
  14. Trade-off: coarse-grained matching. The model must compress everything about an image, audio clip, or document into one vector.


  15. Multi-vector (ColBERT-style) embeddings:
  16. N vectors per item (e.g., 200 tokens x 128 dims = 100 KB per document page)
  17. Requires late interaction scoring: score = sum of max similarities between query tokens and document tokens
  18. Specialized indexes (PLAID, ColBERT Engine) needed for efficient retrieval
  19. 10-50x more storage per item
  20. Trade-off: token-level precision. The model preserves fine-grained information at the cost of storage and retrieval complexity.


  21. For most production systems, the pragmatic approach is a two-stage pipeline: use dense embeddings for first-stage retrieval (top 100-1000 candidates), then rerank with a multi-vector or cross-encoder model. This gets the precision of fine-grained matching without indexing overhead on the full collection.

    Production Architecture



    A production omnimodal embedding pipeline follows three stages:

    Stage 1: Decompose



    Split multimedia content into embeddable units:
  22. Video -- segments (scene detection) + keyframes + audio track
  23. Audio -- fixed-length chunks (15-30s) with overlap
  24. Documents -- page images (for visual doc retrieval) or extracted text chunks
  25. Images -- passed through directly


  26. Stage 2: Embed



    Run the omnimodal model on all units. Because the model handles all modalities, you run a single model server instead of four:

    # Before: four model calls per video
    clip_vec = clip_model.encode(keyframe)
    whisper_text = whisper_model.transcribe(audio)
    bge_vec = bge_model.encode(whisper_text)
    clap_vec = clap_model.encode(audio)

    # After: one model, one embedding space vec = omni_model.encode(segment) # works for any modality


    Stage 3: Index



    Store all embeddings in a single vector index. Because they share a space, you don't need per-modality indexes or score normalization:

    # Single index for all modalities
    index.add(vec, metadata={
        "source_id": video_id,
        "modality": "video_segment",
        "start_time": 45.2,
        "end_time": 52.8
    })
    


    At query time, the text query is encoded with the same model and compared against the entire index. Results from different modalities are directly comparable -- a 0.85 similarity to a video frame and a 0.83 similarity to an audio clip are on the same scale.

    When Not to Use Omnimodal Models



    Omnimodal models are not always the right choice:

  27. Single-modality workloads. If you only index text documents, a specialized text embedding model (BGE, Qwen3-Embedding) will outperform the text encoding of an omnimodal model. Specialization still wins within a modality.
  28. Latency-critical paths. Omnimodal models are 3-9B parameters. If your p99 latency budget is under 10ms for encoding, a 110M-parameter MiniLM is a better fit for the text path.
  29. No cross-modal queries. If users never search audio content using text queries, or never search video using audio, the unified space adds complexity without benefit. Per-modality models are simpler to operate.
  30. Regulatory constraints. NVIDIA's Omni-Embed Nemotron is non-commercial. E5-Omni (MIT) and VLM2Vec V2 (Apache 2.0) are commercially friendly, but verify license compatibility for your deployment.


  31. The decision framework: if your retrieval pipeline crosses modality boundaries (text query to video results, audio query to document results), an omnimodal model simplifies architecture and often improves quality. If queries and results stay within the same modality, specialized models are simpler and faster.

    Evaluation: MMEB-V2 and Beyond



    The standard benchmark for omnimodal embeddings is MMEB-V2, introduced by the VLM2Vec team in 2025. It covers 78 datasets across three modality groups:

    GroupTasksMetricExample Datasets
    Image (36 tasks)Classification, retrieval, VQAHit@1ImageNet, CIFAR, STS
    Video (18 tasks)Retrieval, moment retrieval, QA, classificationHit@1MSRVTT, ActivityNet, FineVideo
    VisDoc (24 tasks)Document retrieval, page classificationnDCG@5ViDoRe, DocVQA, InfographicsVQA
    Audio evaluation is less standardized. AudioCaps Recall@1 is the most common metric, but coverage is thin compared to vision and text benchmarks.

    For production evaluation, MMEB-V2 scores correlate reasonably well with real-world retrieval quality, but two caveats apply:

    1. Domain shift. MMEB-V2 benchmarks are academic datasets. If your content is surveillance footage, medical imaging, or niche industry documents, benchmark scores may not predict production performance. Always evaluate on a held-out set from your actual data.

    2. Query distribution. Benchmarks use short, well-formed queries. Real users write messy queries ("that video from last tuesday with the graph"), use voice-to-text input with errors, or search in languages not well-represented in training data.

    Further Reading



  32. Contrastive Learning -- the training objective underlying most embedding models
  33. Multi-Stage Retrieval -- building retrieval pipelines that combine dense and reranking stages
  34. Visual Document Retrieval -- ColPali and ColQwen for document search without OCR
  35. Audio Feature Extraction -- how audio-specific encoders work before omnimodal models
  36. Models -- compare embedding models by modality, parameters, and benchmarks
  37. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs