Qwen3-VL-4B-Instruct
by Qwen
Best-in-class 4B vision-language model with 256K context and 32-language OCR
Qwen/Qwen3-VL-4B-Instructmixpeek://image_extractor@v1/qwen3_vl_4b_v1Overview
Qwen3-VL-4B-Instruct is a dense 4.4B-parameter vision-language model with a three-module architecture: vision encoder, MLP-based vision-language merger, and LLM decoder. It supports 256K-1M context, 32-language OCR, native video temporal reasoning, and strong document understanding with 95.3% on DocVQA and 88.1% on OCRBench.
On Mixpeek, Qwen3-VL-4B powers scene captioning, visual question answering, and document understanding at the 4B parameter sweet spot, offering the best quality-to-cost ratio for pipelines that need both visual and text comprehension.
Architecture
Dense transformer (36 layers, GQA 32/8) with 4.44B parameters. Three-module design: vision encoder, MLP vision-language merger, and LLM decoder. Interleaved-MRoPE for video temporal reasoning, DeepStack for multi-level ViT feature fusion, and Text-Timestamp Alignment for event localization.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest(collection_id="document-archive",source="s3://documents/",extractors=[{"type": "scene_caption","model": "Qwen/Qwen3-VL-4B-Instruct","output_feature": "caption"},{"type": "text_embedding","model": "Qwen/Qwen3-Embedding-8B","input_field": "caption","output_feature": "caption_embedding"}])
Capabilities
- 256K-1M context window
- 32-language OCR and document understanding
- Native video temporal reasoning with timestamp alignment
- 95.3% DocVQA, 88.1% OCRBench
- Apache 2.0 license
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| DocVQA (test) | Accuracy | 95.3% | Qwen, 2025 — Qwen3-VL Technical Report |
| OCRBench | Score | 88.1% | Qwen, 2025 — Qwen3-VL Technical Report |
| MMBench-V1.1 | Score | 85.1% | Qwen, 2025 — Qwen3-VL Technical Report |
Performance
Specification
Research Paper
Qwen3-VL Technical Report
arxiv.orgBuild a pipeline with Qwen3-VL-4B-Instruct
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio