Qwen3-VL-4B-Instruct
by Qwen
Best-in-class 4B vision-language model with 256K context and 32-language OCR
Qwen/Qwen3-VL-4B-Instructmixpeek://image_extractor@v1/qwen3_vl_4b_v1Overview
Qwen3-VL-4B-Instruct is a dense 4.4B-parameter vision-language model with a three-module architecture: vision encoder, MLP-based vision-language merger, and LLM decoder. It supports 256K-1M context, 32-language OCR, native video temporal reasoning, and strong document understanding with 95.3% on DocVQA and 88.1% on OCRBench.
On Mixpeek, Qwen3-VL-4B powers scene captioning, visual question answering, and document understanding at the 4B parameter sweet spot, offering the best quality-to-cost ratio for pipelines that need both visual and text comprehension.
Architecture
Dense transformer (36 layers, GQA 32/8) with 4.44B parameters. Three-module design: vision encoder, MLP vision-language merger, and LLM decoder. Interleaved-MRoPE for video temporal reasoning, DeepStack for multi-level ViT feature fusion, and Text-Timestamp Alignment for event localization.
Mixpeek SDK Integration
import { Mixpeek } from "mixpeek";
const mx = new Mixpeek({ apiKey: "API_KEY" });
// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
namespace_id: "my-namespace",
collection_name: "my-collection",
source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
feature_extractor: {
feature_extractor_name: "scene_caption",
version: "v1",
parameters: { model_id: "Qwen/Qwen3-VL-4B-Instruct" },
},
});Capabilities
- 256K-1M context window
- 32-language OCR and document understanding
- Native video temporal reasoning with timestamp alignment
- 95.3% DocVQA, 88.1% OCRBench
- Apache 2.0 license
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| DocVQA (test) | Accuracy | 95.3% | Qwen, 2025 — Qwen3-VL Technical Report |
| OCRBench | Score | 88.1% | Qwen, 2025 — Qwen3-VL Technical Report |
| MMBench-V1.1 | Score | 85.1% | Qwen, 2025 — Qwen3-VL Technical Report |
Performance
Common Pipeline Companions
Explore on Mixpeek
Compare alternatives in this category
Hand-picked tools & platforms compared
Deep-dive technical guide
See how Mixpeek runs models as extractors
Store & search embeddings at scale
Usage-based pricing for pipelines
Compare models, APIs & infrastructure
Specification
Research Paper
Qwen3-VL Technical Report
arxiv.orgBuild a pipeline with Qwen3-VL-4B-Instruct
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio