NEWWhy single embeddings fail for video.Read the post →
    Models/Embeddings/Qwen/Qwen3-VL-Embedding-8B
    HFText EmbeddingsApache 2.0

    Qwen3-VL-Embedding-8B

    by Qwen

    #1 multimodal embedding model — unified text, image, screenshot, and video retrieval

    1.6Mdl/month
    8Bparams
    Identifiers
    Model ID
    Qwen/Qwen3-VL-Embedding-8B
    Feature URI
    mixpeek://text_extractor@v1/qwen3_vl_embed_8b_v1

    Overview

    Qwen3-VL-Embedding-8B is a unified multimodal embedding model that projects text, images, screenshots, and video into a shared vector space. It achieves state-of-the-art results on MMEB-V2 (77.9 overall), the most comprehensive multimodal retrieval benchmark, and scores 83.3 on visual document retrieval — making it the strongest general-purpose multimodal embedding available.

    Built on the Qwen3-VL vision-language backbone, it supports Matryoshka flexible dimensionality (64 to 4096), 32K context windows, and 30+ languages. On Mixpeek, it powers cross-modal retrieval where a text query can match images, screenshots, video frames, or documents in a single vector search pass.

    Architecture

    Qwen3-VL vision-language backbone (8B parameters) with shared projection heads for text, image, and video modalities. Uses Matryoshka Representation Learning for flexible embedding dimensions from 64 to 4096. Supports interleaved text-image input sequences up to 32K tokens.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/slides.pdf" },
    feature_extractors: [{
    name: "image_embedding",
    version: "v1",
    params: {
    model_id: "Qwen/Qwen3-VL-Embedding-8B",
    embedding_dim: 1024
    }
    }]
    });

    Capabilities

    • Unified embeddings across text, images, video, and screenshots
    • Matryoshka flexible dimensionality (64–4096)
    • 32K context window for long documents and multi-frame video
    • 30+ language support including CJK
    • #1 on MMEB-V2 multimodal retrieval benchmark

    Use Cases on Mixpeek

    Cross-modal search: find images by text description or text by image query
    Visual document retrieval: search PDFs, slides, and screenshots by content
    Video retrieval: embed and search video frames alongside transcripts
    Multilingual multimodal search across mixed-language media libraries

    Benchmarks

    DatasetMetricScoreSource
    MMEB-V2 (overall)Score77.9Qwen, 2026 — MMEB-V2 Leaderboard
    MMEB-V2 (visual doc retrieval)Score83.3Qwen, 2026 — MMEB-V2 Leaderboard
    MTEB MultilingualScore70.58Qwen, 2026 — Model Card

    Performance

    Input SizeVariable (text, 224px–1344px images, multi-frame video)
    Embedding Dim64–4096 (Matryoshka)
    GPU Latency~45ms / item (A100)
    GPU Throughput~22 items/sec (A100)
    GPU Memory~16 GB

    Specification

    FrameworkHF
    OrganizationQwen
    FeatureText Embeddings
    Output1024-dim vector
    Modalitiesdocument, audio
    RetrieverText Similarity
    Parameters8B
    LicenseApache 2.0
    Downloads/mo1.6M

    Research Paper

    Qwen3-Embedding: Advancing Text and Multimodal Retrieval

    arxiv.org

    Build a pipeline with Qwen3-VL-Embedding-8B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio