VLM2Vec-V2.0

by VLM2Vec

Compact multimodal embedding for images, videos, and visual documents

3.9Kdl/month

~2Bparams

HuggingFace Use in Pipeline

Identifiers

Model ID

VLM2Vec/VLM2Vec-V2.0

Feature URI

mixpeek://image_extractor@v1/vlm2vec_v2_v1

Overview

VLM2Vec V2 is a 2B-parameter multimodal embedding model that punches above its weight — achieving results competitive with 7B models on the MMEB-V2 benchmark. Built on Qwen2-VL-2B-Instruct with LoRA fine-tuning, it introduced the MMEB-V2 benchmark itself, extending evaluation to video retrieval, moment retrieval, and video QA.

On Mixpeek, VLM2Vec V2 is the best choice when you need multimodal embeddings at scale without the memory overhead of larger models. At 2B parameters, it runs on a single consumer GPU while delivering competitive cross-modal retrieval quality.

Architecture

Qwen2-VL-2B-Instruct with LoRA fine-tuning. Last-token pooling with normalization. Trained on MMEB-train (2.14M samples) with batch size 1024 for 2K steps, temperature 0.02. Configurable fps and max_pixels for video input.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

await mx.collections.ingest({
  collection_id: "video-archive",
  source: { url: "https://example.com/training-video.mp4" },
  feature_extractors: [{
    feature: "multimodal_embedding",
    model: "VLM2Vec/VLM2Vec-V2.0"
  }]
});

Capabilities

Competitive with 7B models at 2B parameters
Image, video, and visual document embeddings
Video retrieval, moment retrieval, and video classification
Configurable video frame rate and resolution
58.0 overall on MMEB-V2 (78 tasks)

Use Cases on Mixpeek

Cost-efficient video embedding: index large video libraries on modest hardware

Visual document search: find pages in scanned archives by content

Video moment retrieval: locate specific scenes within long videos

Hybrid pipelines: lightweight embedding stage before heavier reranking

Benchmarks

Dataset	Metric	Score	Source
MMEB-V2 (78 tasks)	Overall	58.0	TIGER-Lab, 2025 — arxiv:2507.04590
MMEB-V2 Image (36 tasks)	Hit@1	64.9	TIGER-Lab, 2025 — arxiv:2507.04590
MMEB-V2 VisDoc (24 tasks)	nDCG@5	65.4	TIGER-Lab, 2025 — arxiv:2507.04590