Qwen3-VL-4B-Instruct

by Qwen

Best-in-class 4B vision-language model with 256K context and 32-language OCR

580Kdl/month

4.4Bparams

HuggingFace Run on your data, free

Identifiers

Model ID

Qwen/Qwen3-VL-4B-Instruct

Feature URI

mixpeek://image_extractor@v1/qwen3_vl_4b_v1

Overview

Qwen3-VL-4B-Instruct is a dense 4.4B-parameter vision-language model with a three-module architecture: vision encoder, MLP-based vision-language merger, and LLM decoder. It supports 256K-1M context, 32-language OCR, native video temporal reasoning, and strong document understanding with 95.3% on DocVQA and 88.1% on OCRBench.

On Mixpeek, Qwen3-VL-4B powers scene captioning, visual question answering, and document understanding at the 4B parameter sweet spot, offering the best quality-to-cost ratio for pipelines that need both visual and text comprehension.

Architecture

Dense transformer (36 layers, GQA 32/8) with 4.44B parameters. Three-module design: vision encoder, MLP vision-language merger, and LLM decoder. Interleaved-MRoPE for video temporal reasoning, DeepStack for multi-level ViT feature fusion, and Text-Timestamp Alignment for event localization.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "scene_caption",
    version: "v1",
    parameters: { model_id: "Qwen/Qwen3-VL-4B-Instruct" },
  },
});

Capabilities

256K-1M context window
32-language OCR and document understanding
Native video temporal reasoning with timestamp alignment
95.3% DocVQA, 88.1% OCRBench
Apache 2.0 license

Use Cases on Mixpeek

Document understanding and extraction (invoices, forms, contracts)

Video scene captioning with temporal event localization

Multilingual OCR across diverse document types and languages

Benchmarks

Dataset	Metric	Score	Source
DocVQA (test)	Accuracy	95.3%	Qwen, 2025 — Qwen3-VL Technical Report
OCRBench	Score	88.1%	Qwen, 2025 — Qwen3-VL Technical Report
MMBench-V1.1	Score	85.1%	Qwen, 2025 — Qwen3-VL Technical Report