NEWWhy single embeddings fail for video.Read the post →
    Models/Embeddings/Qwen/Qwen3-VL-Embedding-2B
    HFText EmbeddingsApache 2.0

    Qwen3-VL-Embedding-2B

    by Qwen

    Unified multimodal embedding for text, image, video, and screenshots

    2.4Mdl/month
    2Bparams
    Identifiers
    Model ID
    Qwen/Qwen3-VL-Embedding-2B
    Feature URI
    mixpeek://text_extractor@v1/qwen3_vl_embed_2b_v1

    Overview

    Qwen3-VL-Embedding-2B is a multimodal embedding model built on the Qwen3-VL architecture that generates semantically rich vectors capturing both visual and textual information in a shared embedding space. It supports Matryoshka Representation Learning for flexible embedding dimensions from 64 to 2048, retaining over 92% of peak performance even at 64 dimensions.

    On Mixpeek, Qwen3-VL-Embedding-2B enables true cross-modal retrieval where users can search across images, videos, screenshots, and text documents using any modality as the query. This makes it ideal for building unified search over heterogeneous content libraries.

    Architecture

    Built on Qwen3-VL 2B backbone with multi-stage training: large-scale contrastive pre-training followed by reranking model distillation. Supports Matryoshka Representation Learning for flexible output dimensions (64 to 2048). Handles inputs up to 32K tokens including text, images, and video.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "my-collection",
    source: { url: "https://example.com/video.mp4" },
    feature_extractors: [{
    name: "multimodal_embedding",
    version: "v1",
    params: {
    model_id: "Qwen/Qwen3-VL-Embedding-2B"
    }
    }]
    });

    Capabilities

    • Unified embeddings across text, image, video, and screenshot inputs
    • 2048-dimensional embeddings with Matryoshka flexibility (64-2048)
    • Cross-modal retrieval: search images with text, text with images
    • Retains 92%+ performance at 64 dimensions (32x compression)
    • 30+ language support inherited from Qwen3-VL

    Use Cases on Mixpeek

    Cross-modal search across mixed media libraries with text, image, and video content
    Visual document retrieval for screenshot and infographic search
    Video-text matching for content discovery across large video catalogs

    Benchmarks

    DatasetMetricScoreSource
    MMEB-V2Overall Score~72 (2B variant)Qwen3-VL-Embedding paper, Jan 2026
    Image-text retrievalRecall@10Competitive with 8B variantQwen3-VL-Embedding paper, Jan 2026

    Performance

    Input SizeText: 32K tokens; Image: variable; Video: multi-frame
    Embedding Dim2048 (Matryoshka: 64-2048)
    GPU Latency~12ms / item (A100)
    GPU Throughput~80 items/sec (A100)
    GPU Memory~4.2 GB

    Specification

    FrameworkHF
    OrganizationQwen
    FeatureText Embeddings
    Output1024-dim vector
    Modalitiesdocument, audio
    RetrieverText Similarity
    Parameters2B
    LicenseApache 2.0
    Downloads/mo2.4M

    Research Paper

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for Multimodal Retrieval

    arxiv.org

    Build a pipeline with Qwen3-VL-Embedding-2B

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio