NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/jinaai/jina-embeddings-v4
    HFVisual EmbeddingsApache-2.0

    jina-embeddings-v4

    by jinaai

    Universal multimodal multilingual embeddings with task-specific LoRA adapters

    1.5Mdl/month
    3.8Bparams
    Identifiers
    Model ID
    jinaai/jina-embeddings-v4
    Feature URI
    mixpeek://image_extractor@v1/jina_embeddings_v4

    Overview

    Jina Embeddings v4 is a 3.8B-parameter multimodal embedding model built on the Qwen2.5-VL-3B-Instruct backbone. It unifies text and image representations through a shared pathway, supporting both single-vector (2048-dim, truncatable to 128) and multi-vector (128-dim per token) output modes for late-interaction retrieval.

    Three task-specific LoRA adapters (60M parameters each) optimize performance for retrieval, text-matching, and code search without modifying the frozen backbone. On Mixpeek, jina-embeddings-v4 powers cross-modal search across documents with tables, charts, and mixed-media content, excelling where visual layout matters as much as text.

    Architecture

    Qwen2.5-VL-3B-Instruct backbone with vision encoder for image-to-token conversion. Dual output modes: single-vector (2048-dim via mean pooling) and multi-vector (128-dim per token via projection layers). Three frozen LoRA adapters (60M each) for retrieval, text-matching, and code search tasks.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="mixed-media-docs",
    source="s3://reports/",
    extractors=[{
    "type": "visual_embedding",
    "model": "jinaai/jina-embeddings-v4",
    "output_feature": "multimodal_embedding"
    }]
    )

    Capabilities

    • Multimodal: text and image in a shared embedding space
    • 2048-dimensional single-vector or 128-dim multi-vector output
    • Task-specific LoRA adapters for retrieval, matching, and code
    • Matryoshka dimensions (2048 down to 128)
    • Strong on visually rich documents: tables, charts, diagrams

    Use Cases on Mixpeek

    Cross-modal document retrieval where layout and visuals matter (charts, infographics)
    Multilingual semantic search across mixed-media collections
    Code search and retrieval with the dedicated code LoRA adapter

    Benchmarks

    DatasetMetricScoreSource
    MTEB-en (text retrieval)nDCG@1055.97Jina AI, 2025 — jina-embeddings-v4 paper
    CLIP Benchmark (cross-modal)Score84.11Jina AI, 2025 — jina-embeddings-v4 paper
    LongEmbedScore67.11Jina AI, 2025 — jina-embeddings-v4 paper

    Performance

    Input SizeText: 8192 tokens; Image: variable resolution
    Embedding Dim2048 (single-vector) / 128 (multi-vector per token)
    GPU Latency~15ms / item (A100)
    GPU Throughput~200 items/sec (A100, batch 32)
    GPU Memory~8.5 GB

    Specification

    FrameworkHF
    Organizationjinaai
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters3.8B
    LicenseApache-2.0
    Downloads/mo1.5M

    Research Paper

    jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

    arxiv.org

    Build a pipeline with jina-embeddings-v4

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio