VLM2Vec-V2.0
by VLM2Vec
Compact multimodal embedding for images, videos, and visual documents
VLM2Vec/VLM2Vec-V2.0mixpeek://image_extractor@v1/vlm2vec_v2_v1Overview
VLM2Vec V2 is a 2B-parameter multimodal embedding model that punches above its weight — achieving results competitive with 7B models on the MMEB-V2 benchmark. Built on Qwen2-VL-2B-Instruct with LoRA fine-tuning, it introduced the MMEB-V2 benchmark itself, extending evaluation to video retrieval, moment retrieval, and video QA.
On Mixpeek, VLM2Vec V2 is the best choice when you need multimodal embeddings at scale without the memory overhead of larger models. At 2B parameters, it runs on a single consumer GPU while delivering competitive cross-modal retrieval quality.
Architecture
Qwen2-VL-2B-Instruct with LoRA fine-tuning. Last-token pooling with normalization. Trained on MMEB-train (2.14M samples) with batch size 1024 for 2K steps, temperature 0.02. Configurable fps and max_pixels for video input.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "video-archive",source: { url: "https://example.com/training-video.mp4" },feature_extractors: [{feature: "multimodal_embedding",model: "VLM2Vec/VLM2Vec-V2.0"}]});
Capabilities
- Competitive with 7B models at 2B parameters
- Image, video, and visual document embeddings
- Video retrieval, moment retrieval, and video classification
- Configurable video frame rate and resolution
- 58.0 overall on MMEB-V2 (78 tasks)
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMEB-V2 (78 tasks) | Overall | 58.0 | TIGER-Lab, 2025 — arxiv:2507.04590 |
| MMEB-V2 Image (36 tasks) | Hit@1 | 64.9 | TIGER-Lab, 2025 — arxiv:2507.04590 |
| MMEB-V2 VisDoc (24 tasks) | nDCG@5 | 65.4 | TIGER-Lab, 2025 — arxiv:2507.04590 |
Performance
Specification
Research Paper
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
arxiv.orgBuild a pipeline with VLM2Vec-V2.0
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio