Qwen3-VL-Embedding-2B

by Qwen

Unified multimodal embedding for text, image, video, and screenshots

1.3Mdl/month

434likes

2.1Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

Qwen/Qwen3-VL-Embedding-2B

Feature URI

mixpeek://text_extractor@v1/qwen3_vl_embed_2b_v1

Overview

Qwen3-VL-Embedding-2B is a multimodal embedding model built on the Qwen3-VL architecture that generates semantically rich vectors capturing both visual and textual information in a shared embedding space. It supports Matryoshka Representation Learning for flexible embedding dimensions from 64 to 2048, retaining over 92% of peak performance even at 64 dimensions.

On Mixpeek, Qwen3-VL-Embedding-2B enables true cross-modal retrieval where users can search across images, videos, screenshots, and text documents using any modality as the query. This makes it ideal for building unified search over heterogeneous content libraries.

Architecture

Built on Qwen3-VL 2B backbone with multi-stage training: large-scale contrastive pre-training followed by reranking model distillation. Supports Matryoshka Representation Learning for flexible output dimensions (64 to 2048). Handles inputs up to 32K tokens including text, images, and video.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "multimodal_embedding",
    version: "v1",
    parameters: { model_id: "Qwen/Qwen3-VL-Embedding-2B" },
  },
});

Capabilities

Unified embeddings across text, image, video, and screenshot inputs
2048-dimensional embeddings with Matryoshka flexibility (64-2048)
Cross-modal retrieval: search images with text, text with images
Retains 92%+ performance at 64 dimensions (32x compression)
30+ language support inherited from Qwen3-VL

Use Cases on Mixpeek

Cross-modal search across mixed media libraries with text, image, and video content

Visual document retrieval for screenshot and infographic search

Video-text matching for content discovery across large video catalogs

Benchmarks

Dataset	Metric	Score	Source
MMEB-V2	Overall Score	~72 (2B variant)	Qwen3-VL-Embedding paper, Jan 2026
Image-text retrieval	Recall@10	Competitive with 8B variant	Qwen3-VL-Embedding paper, Jan 2026