NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/jinaai/jina-embeddings-v5-omni-nano
    HFVisual EmbeddingsApache-2.0

    jina-embeddings-v5-omni-nano

    by jinaai

    Compact omni-modal embedding model for text, images, video, and audio in one vector space

    4.2Kdl/month
    239Mparams
    Identifiers
    Model ID
    jinaai/jina-embeddings-v5-omni-nano
    Feature URI
    mixpeek://image_extractor@v1/jina_embeddings_v5_omni_nano

    Overview

    Jina Embeddings v5 Omni Nano is the smallest model in the Jina v5 omni family, placing text, images, video frames, and audio into a single shared vector space. At ~239M parameters, it runs efficiently on edge devices and high-throughput pipelines.

    The model shares the same text embedding space as jina-v5-text, meaning existing text indexes remain backwards-compatible when adding multimodal content. This makes it the lowest-friction path to cross-modal search.

    Architecture

    Multimodal transformer encoder with separate input projections for text, image, video, and audio modalities. All modalities project into a shared embedding space. Matryoshka representation learning enables flexible output dimensions.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="media-library",
    source="s3://assets/",
    extractors=[{
    "type": "visual_embedding",
    "model": "jinaai/jina-embeddings-v5-omni-nano",
    "output_feature": "omni_embedding"
    }]
    )

    Capabilities

    • Omni-modal: text, images, video, audio in one space
    • Backwards-compatible with jina-v5-text indexes
    • ~239M parameters for edge/high-throughput deployment
    • Matryoshka dimensions for flexible storage
    • Apache 2.0 license

    Use Cases on Mixpeek

    Cross-modal search (find images matching text queries, or vice versa)
    High-throughput multimodal indexing where latency matters
    Edge deployment for on-device multimodal understanding

    Benchmarks

    DatasetMetricScoreSource
    Cross-modal retrievalRecall@10Competitive with 677M variantJina AI, May 2026

    Performance

    Input SizeText: 8192 tokens; Image: 224x224+; Audio: 30s clips
    GPU Latency~3ms / item (A100)
    GPU Throughput~3000 items/sec (A100, batch 128)
    GPU Memory~0.5 GB

    Specification

    FrameworkHF
    Organizationjinaai
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters239M
    LicenseApache-2.0
    Downloads/mo4.2K

    Research Paper

    Jina Embeddings v5 Omni: Multimodal Embeddings for Text, Image, Audio, and Video

    arxiv.org

    Build a pipeline with jina-embeddings-v5-omni-nano

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio