NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/jinaai/jina-embeddings-v5-omni-small
    HFVisual EmbeddingsCC BY-NC 4.0

    jina-embeddings-v5-omni-small

    by jinaai

    True omni-modal embeddings: text, image, audio, and video in one vector space

    16.9Kdl/month
    2Bparams
    Identifiers
    Model ID
    jinaai/jina-embeddings-v5-omni-small
    Feature URI
    mixpeek://image_extractor@v1/jina_embeddings_v5_omni_small

    Overview

    Jina Embeddings v5 Omni Small is a 2B-parameter embedding model that accepts text, images, audio, and video as input and produces 1024-dimensional vectors in a shared embedding space. This means you can index a video, then query it with text, an image, or an audio clip -- all vectors live in the same space.

    The model aligns with jina-embeddings-v5-text, so text-only queries remain high quality. It supports Matryoshka representation learning, allowing you to truncate embeddings to smaller dimensions (512, 256) with graceful quality degradation.

    Architecture

    Based on a multimodal encoder with separate modality-specific preprocessors feeding into a shared transformer backbone. Supports Matryoshka dimensions (1024, 512, 256). Available in GGUF format for llama.cpp deployment.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    feature: "visual_embeddings",
    model: "jinaai/jina-embeddings-v5-omni-small"
    }]
    });

    Capabilities

    • Accepts text, images, audio, and video as embedding input
    • 1024-dimensional output aligned across all modalities
    • Matryoshka dimensions for size-quality tradeoff
    • Compatible with jina-embeddings-v5-text vector space
    • GGUF format available for edge deployment

    Use Cases on Mixpeek

    Cross-modal search: index videos, query with text or audio clips
    Multimodal RAG: embed documents with mixed text, images, and audio into a single retriever
    Content deduplication across modalities: find similar content regardless of format
    Agent perception: give agents a unified embedding space for all sensory inputs

    Specification

    FrameworkHF
    Organizationjinaai
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters2B
    LicenseCC BY-NC 4.0
    Downloads/mo16.9K

    Research Paper

    Jina Embeddings v5

    arxiv.org

    Build a pipeline with jina-embeddings-v5-omni-small

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio