NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/Haon-Chen/e5-omni-7B
    HFVisual EmbeddingsMIT

    e5-omni-7B

    by Haon-Chen

    State-of-the-art omnimodal embedding with explicit cross-modal alignment

    261dl/month
    ~9Bparams
    Identifiers
    Model ID
    Haon-Chen/e5-omni-7B
    Feature URI
    mixpeek://image_extractor@v1/haon_chen_e5_omni_7b_v1

    Overview

    E5-Omni is Microsoft's omnimodal embedding model that achieves state-of-the-art on the MMEB-V2 benchmark across text, image, audio, and video tasks. Built on Qwen2.5-Omni-7B, it introduces modality-aware temperature calibration, controllable negative curriculum learning, and batch whitening for cross-modal alignment.

    On Mixpeek, E5-Omni delivers the highest-quality cross-modal embeddings available — its explicit alignment techniques mean that similarity scores between different modalities (e.g., text query vs. audio clip) are more reliable than models trained with simple contrastive objectives.

    Architecture

    Qwen2.5-Omni-7B backbone with three alignment components: (1) modality-aware temperature calibration, (2) controllable negative curriculum that progressively masks easy negatives, (3) batch whitening and covariance alignment. ~9B total parameters. Unified embedding space for all modalities.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "research-library",
    source: { url: "https://example.com/lecture-recording.mp4" },
    feature_extractors: [{
    feature: "multimodal_embedding",
    model: "Haon-Chen/e5-omni-7B"
    }]
    });

    Capabilities

    • SOTA on MMEB-V2 benchmark (66.4 overall across 78 tasks)
    • Best audio retrieval among omnimodal models (37.7 Recall@1 on AudioCaps)
    • Unified text, image, audio, and video embeddings
    • Explicit cross-modal alignment for reliable similarity scores
    • Outperforms 3B models by 15+ points on MMEB-V2

    Use Cases on Mixpeek

    Cross-modal retrieval: find audio clips matching a text description
    Multimedia RAG: unified retrieval across all content types
    Audio-visual search: query meetings by both spoken content and visual slides
    Research libraries: embed papers, presentations, and recorded talks together

    Benchmarks

    DatasetMetricScoreSource
    MMEB-V2 (78 tasks)Overall66.4Chen et al., 2025 — arxiv:2601.03666
    MMEB-V2 Image (36 tasks)Hit@171.2Chen et al., 2025 — arxiv:2601.03666
    AudioCapsRecall@137.7Chen et al., 2025 — arxiv:2601.03666

    Performance

    Input SizeVariable (text/image/audio/video)
    Embedding Dim3584
    GPU Latency~35ms / item (A100)
    GPU Throughput~28 items/sec (A100)
    GPU Memory~18 GB

    Specification

    FrameworkHF
    OrganizationHaon-Chen
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters~9B
    LicenseMIT
    Downloads/mo261

    Research Paper

    e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

    arxiv.org

    Build a pipeline with e5-omni-7B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio