NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/nvidia/omni-embed-nemotron-3b
    HFVisual EmbeddingsNVIDIA OneWay NC

    omni-embed-nemotron-3b

    by nvidia

    Unified embedding model for text, image, audio, and video retrieval in a single vector space

    N/Adl/month
    4.7Bparams
    Identifiers
    Model ID
    nvidia/omni-embed-nemotron-3b
    Feature URI
    mixpeek://image_extractor@v1/nvidia_omni_embed_nemotron_3b_v1

    Overview

    Omni-Embed Nemotron is NVIDIA's omnimodal embedding model that encodes text, images, audio, and video into a shared 2048-dimensional vector space. Built on the Thinker component of Qwen2.5-Omni-3B, it processes each modality independently and projects into a single retrieval-ready embedding.

    On Mixpeek, Omni-Embed Nemotron enables true cross-modal search — query with text and retrieve matching video clips, audio segments, document pages, or images from a single index. One model replaces four separate embedding pipelines.

    Architecture

    Transformer-based encoder derived from Qwen2.5-Omni-3B (Thinker only, no Talker). 2048-dim output embeddings. 32K max context tokens. Modality-separated encoding with independent audio and video processing paths. 4.7B parameters.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "media-library",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    feature: "multimodal_embedding",
    model: "nvidia/omni-embed-nemotron-3b"
    }]
    });

    Capabilities

    • Unified text, image, audio, and video embeddings in one model
    • 2048-dimensional dense vectors for cross-modal retrieval
    • 32K token context window
    • State-of-the-art video retrieval among embedding models
    • Competitive visual document retrieval (85.7 nDCG@5 on ViDoRe V1)

    Use Cases on Mixpeek

    Cross-modal search: query with text, retrieve matching video clips or audio segments
    Unified media index: embed an entire multimedia library into one searchable vector space
    Podcast and meeting search: find audio moments matching visual or textual queries
    Video library retrieval: surface relevant clips by scene description or spoken content

    Benchmarks

    DatasetMetricScoreSource
    ViDoRe V1 (visual doc)nDCG@585.7%NVIDIA, 2025 — Model Card
    MTEB text retrieval (10 tasks)nDCG@10 avg0.606NVIDIA, 2025 — Model Card
    Video retrieval (LPM + FineVideo)nDCG@10 avg0.706NVIDIA, 2025 — Model Card

    Performance

    Input SizeVariable (text/image/audio/video)
    Embedding Dim2048
    GPU Latency~18ms / item (A100)
    GPU Throughput~55 items/sec (A100)
    GPU Memory~9.5 GB

    Specification

    FrameworkHF
    Organizationnvidia
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters4.7B
    LicenseNVIDIA OneWay NC
    Downloads/moN/A

    Research Paper

    Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model

    arxiv.org

    Build a pipeline with omni-embed-nemotron-3b

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio