NEWWhy single embeddings fail for video.Read the post →
    Models/Captioning/Qwen/Qwen3-VL-4B-Instruct
    HFScene CaptioningApache-2.0

    Qwen3-VL-4B-Instruct

    by Qwen

    Best-in-class 4B vision-language model with 256K context and 32-language OCR

    580Kdl/month
    4.4Bparams
    Identifiers
    Model ID
    Qwen/Qwen3-VL-4B-Instruct
    Feature URI
    mixpeek://image_extractor@v1/qwen3_vl_4b_v1

    Overview

    Qwen3-VL-4B-Instruct is a dense 4.4B-parameter vision-language model with a three-module architecture: vision encoder, MLP-based vision-language merger, and LLM decoder. It supports 256K-1M context, 32-language OCR, native video temporal reasoning, and strong document understanding with 95.3% on DocVQA and 88.1% on OCRBench.

    On Mixpeek, Qwen3-VL-4B powers scene captioning, visual question answering, and document understanding at the 4B parameter sweet spot, offering the best quality-to-cost ratio for pipelines that need both visual and text comprehension.

    Architecture

    Dense transformer (36 layers, GQA 32/8) with 4.44B parameters. Three-module design: vision encoder, MLP vision-language merger, and LLM decoder. Interleaved-MRoPE for video temporal reasoning, DeepStack for multi-level ViT feature fusion, and Text-Timestamp Alignment for event localization.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="document-archive",
    source="s3://documents/",
    extractors=[
    {
    "type": "scene_caption",
    "model": "Qwen/Qwen3-VL-4B-Instruct",
    "output_feature": "caption"
    },
    {
    "type": "text_embedding",
    "model": "Qwen/Qwen3-Embedding-8B",
    "input_field": "caption",
    "output_feature": "caption_embedding"
    }
    ]
    )

    Capabilities

    • 256K-1M context window
    • 32-language OCR and document understanding
    • Native video temporal reasoning with timestamp alignment
    • 95.3% DocVQA, 88.1% OCRBench
    • Apache 2.0 license

    Use Cases on Mixpeek

    Document understanding and extraction (invoices, forms, contracts)
    Video scene captioning with temporal event localization
    Multilingual OCR across diverse document types and languages

    Benchmarks

    DatasetMetricScoreSource
    DocVQA (test)Accuracy95.3%Qwen, 2025 — Qwen3-VL Technical Report
    OCRBenchScore88.1%Qwen, 2025 — Qwen3-VL Technical Report
    MMBench-V1.1Score85.1%Qwen, 2025 — Qwen3-VL Technical Report

    Performance

    Input SizeUp to 256K tokens (text + image patches)
    GPU Latency~80ms / image (A100)
    GPU Throughput~95 images/sec (A100, batch 8)
    GPU Memory~9.5 GB

    Specification

    FrameworkHF
    OrganizationQwen
    FeatureScene Captioning
    Outputtext
    Modalitiesvideo, image
    RetrieverSemantic Search
    Parameters4.4B
    LicenseApache-2.0
    Downloads/mo580K

    Research Paper

    Qwen3-VL Technical Report

    arxiv.org

    Build a pipeline with Qwen3-VL-4B-Instruct

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio