NEWManaged multimodal retrieval.Explore platform →
    Models/Embeddings/VLM2Vec/VLM2Vec-V2.0
    HFVisual EmbeddingsApache 2.0

    VLM2Vec-V2.0

    by VLM2Vec

    Compact multimodal embedding for images, videos, and visual documents

    3.9Kdl/month
    ~2Bparams
    Identifiers
    Model ID
    VLM2Vec/VLM2Vec-V2.0
    Feature URI
    mixpeek://image_extractor@v1/vlm2vec_v2_v1

    Overview

    VLM2Vec V2 is a 2B-parameter multimodal embedding model that punches above its weight — achieving results competitive with 7B models on the MMEB-V2 benchmark. Built on Qwen2-VL-2B-Instruct with LoRA fine-tuning, it introduced the MMEB-V2 benchmark itself, extending evaluation to video retrieval, moment retrieval, and video QA.

    On Mixpeek, VLM2Vec V2 is the best choice when you need multimodal embeddings at scale without the memory overhead of larger models. At 2B parameters, it runs on a single consumer GPU while delivering competitive cross-modal retrieval quality.

    Architecture

    Qwen2-VL-2B-Instruct with LoRA fine-tuning. Last-token pooling with normalization. Trained on MMEB-train (2.14M samples) with batch size 1024 for 2K steps, temperature 0.02. Configurable fps and max_pixels for video input.

    Mixpeek SDK Integration

    import { Mixpeek } from "mixpeek";
    const mx = new Mixpeek({ apiKey: "API_KEY" });
    await mx.collections.ingest({
    collection_id: "video-archive",
    source: { url: "https://example.com/training-video.mp4" },
    feature_extractors: [{
    feature: "multimodal_embedding",
    model: "VLM2Vec/VLM2Vec-V2.0"
    }]
    });

    Capabilities

    • Competitive with 7B models at 2B parameters
    • Image, video, and visual document embeddings
    • Video retrieval, moment retrieval, and video classification
    • Configurable video frame rate and resolution
    • 58.0 overall on MMEB-V2 (78 tasks)

    Use Cases on Mixpeek

    Cost-efficient video embedding: index large video libraries on modest hardware
    Visual document search: find pages in scanned archives by content
    Video moment retrieval: locate specific scenes within long videos
    Hybrid pipelines: lightweight embedding stage before heavier reranking

    Benchmarks

    DatasetMetricScoreSource
    MMEB-V2 (78 tasks)Overall58.0TIGER-Lab, 2025 — arxiv:2507.04590
    MMEB-V2 Image (36 tasks)Hit@164.9TIGER-Lab, 2025 — arxiv:2507.04590
    MMEB-V2 VisDoc (24 tasks)nDCG@565.4TIGER-Lab, 2025 — arxiv:2507.04590

    Performance

    Input SizeVariable (image/video/document)
    Embedding Dim1536
    GPU Latency~12ms / image (A100)
    GPU Throughput~80 items/sec (A100)
    GPU Memory~5 GB

    Specification

    FrameworkHF
    OrganizationVLM2Vec
    FeatureVisual Embeddings
    Output768-dim vector
    Modalitiesvideo, image
    RetrieverVector Search
    Parameters~2B
    LicenseApache 2.0
    Downloads/mo3.9K

    Research Paper

    VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

    arxiv.org

    Build a pipeline with VLM2Vec-V2.0

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio