Qwen3-VL-Embedding-2B
by Qwen
Unified multimodal embedding for text, image, video, and screenshots
Qwen/Qwen3-VL-Embedding-2Bmixpeek://text_extractor@v1/qwen3_vl_embed_2b_v1Overview
Qwen3-VL-Embedding-2B is a multimodal embedding model built on the Qwen3-VL architecture that generates semantically rich vectors capturing both visual and textual information in a shared embedding space. It supports Matryoshka Representation Learning for flexible embedding dimensions from 64 to 2048, retaining over 92% of peak performance even at 64 dimensions.
On Mixpeek, Qwen3-VL-Embedding-2B enables true cross-modal retrieval where users can search across images, videos, screenshots, and text documents using any modality as the query. This makes it ideal for building unified search over heterogeneous content libraries.
Architecture
Built on Qwen3-VL 2B backbone with multi-stage training: large-scale contrastive pre-training followed by reranking model distillation. Supports Matryoshka Representation Learning for flexible output dimensions (64 to 2048). Handles inputs up to 32K tokens including text, images, and video.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";const mx = new Mixpeek({ apiKey: "API_KEY" });await mx.collections.ingest({collection_id: "my-collection",source: { url: "https://example.com/video.mp4" },feature_extractors: [{name: "multimodal_embedding",version: "v1",params: {model_id: "Qwen/Qwen3-VL-Embedding-2B"}}]});
Capabilities
- Unified embeddings across text, image, video, and screenshot inputs
- 2048-dimensional embeddings with Matryoshka flexibility (64-2048)
- Cross-modal retrieval: search images with text, text with images
- Retains 92%+ performance at 64 dimensions (32x compression)
- 30+ language support inherited from Qwen3-VL
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| MMEB-V2 | Overall Score | ~72 (2B variant) | Qwen3-VL-Embedding paper, Jan 2026 |
| Image-text retrieval | Recall@10 | Competitive with 8B variant | Qwen3-VL-Embedding paper, Jan 2026 |
Performance
Specification
Research Paper
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for Multimodal Retrieval
arxiv.orgBuild a pipeline with Qwen3-VL-Embedding-2B
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio