Qwen3-VL-Embedding-8B

by Qwen

#1 multimodal embedding model — unified text, image, screenshot, and video retrieval

1.7Mdl/month

459likes

8.1Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

Qwen/Qwen3-VL-Embedding-8B

Feature URI

mixpeek://text_extractor@v1/qwen3_vl_embed_8b_v1

Overview

Qwen3-VL-Embedding-8B is a unified multimodal embedding model that projects text, images, screenshots, and video into a shared vector space. It achieves state-of-the-art results on MMEB-V2 (77.9 overall), the most comprehensive multimodal retrieval benchmark, and scores 83.3 on visual document retrieval — making it the strongest general-purpose multimodal embedding available.

Built on the Qwen3-VL vision-language backbone, it supports Matryoshka flexible dimensionality (64 to 4096), 32K context windows, and 30+ languages. On Mixpeek, it powers cross-modal retrieval where a text query can match images, screenshots, video frames, or documents in a single vector search pass.

Architecture

Qwen3-VL vision-language backbone (8B parameters) with shared projection heads for text, image, and video modalities. Uses Matryoshka Representation Learning for flexible embedding dimensions from 64 to 4096. Supports interleaved text-image input sequences up to 32K tokens.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "image_embedding",
    version: "v1",
    parameters: { model_id: "Qwen/Qwen3-VL-Embedding-8B" },
  },
});

Capabilities

Unified embeddings across text, images, video, and screenshots
Matryoshka flexible dimensionality (64–4096)
32K context window for long documents and multi-frame video
30+ language support including CJK
#1 on MMEB-V2 multimodal retrieval benchmark

Use Cases on Mixpeek

Cross-modal search: find images by text description or text by image query

Visual document retrieval: search PDFs, slides, and screenshots by content

Video retrieval: embed and search video frames alongside transcripts

Multilingual multimodal search across mixed-language media libraries

Benchmarks

Dataset	Metric	Score	Source
MMEB-V2 (overall)	Score	77.9	Qwen, 2026 — MMEB-V2 Leaderboard
MMEB-V2 (visual doc retrieval)	Score	83.3	Qwen, 2026 — MMEB-V2 Leaderboard
MTEB Multilingual	Score	70.58	Qwen, 2026 — Model Card