NEWManaged multimodal retrieval.Explore platform →
    Models/Text Extraction/baidu/Qianfan-OCR
    HFOCRApache-2.0

    Qianfan-OCR

    by baidu

    4B unified document intelligence model with Layout-as-Thought reasoning

    482Kdl/month
    4Bparams
    Identifiers
    Model ID
    baidu/Qianfan-OCR
    Feature URI
    mixpeek://image_extractor@v1/baidu_qianfan_ocr_4b_v1

    Overview

    Qianfan-OCR is Baidu's 4B-parameter end-to-end model that unifies document parsing, layout analysis, and document understanding within a single vision-language architecture. Its key innovation is Layout-as-Thought: an optional thinking phase where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final text output, recovering layout grounding capabilities lost by pure end-to-end approaches.

    On Mixpeek, Qianfan-OCR extracts text with layout awareness from complex documents including multi-column pages, tables, and forms, powering structured document search where reading order and spatial relationships matter.

    Architecture

    Three-component VLM: Qianfan-ViT vision encoder (24 Transformer layers, AnyResolution up to 4K, 256 visual tokens per 448x448 tile), 2-layer MLP cross-modal adapter (1024-dim to 2560-dim with GELU), and Qwen3-4B language model backbone (36 layers, GQA 32/8, 32K context extendable to 131K). Layout-as-Thought via think tokens for structured layout prediction.

    Mixpeek SDK Integration

    from mixpeek import Mixpeek
    mx = Mixpeek(api_key="YOUR_KEY")
    mx.ingest(
    collection_id="enterprise-docs",
    source="s3://forms-and-tables/",
    extractors=[{
    "type": "ocr",
    "model": "baidu/Qianfan-OCR",
    "output_feature": "extracted_text"
    }]
    )

    Capabilities

    • Top score on OmniDocBench v1.5 (93.12) among end-to-end models
    • Layout-as-Thought reasoning for structured document understanding
    • AnyResolution processing up to 4K images
    • OCRBench score of 880 (ahead of Qwen3-VL-4B at 873)
    • Apache 2.0 open-source

    Use Cases on Mixpeek

    Complex document parsing where layout and reading order matter
    Table extraction and structured data output from scanned forms
    Enterprise document search with layout-aware text extraction

    Benchmarks

    DatasetMetricScoreSource
    OmniDocBench v1.5Score93.12Baidu, March 2026 — Qianfan-OCR paper
    OCRBenchScore880Baidu, March 2026 — Qianfan-OCR paper
    DocVQAAccuracy92.8%Baidu, March 2026 — Qianfan-OCR paper

    Performance

    Input SizeUp to 4K resolution (AnyResolution, max 4096 visual tokens)
    GPU Latency~1s / page (A100, W8A8 quantized)
    GPU Throughput~1 page/sec (A100, vLLM W8A8)
    GPU Memory~9 GB

    Specification

    FrameworkHF
    Organizationbaidu
    FeatureOCR
    Outputtext + bbox
    Modalitiesvideo, image, document
    RetrieverText-in-Image
    Parameters4B
    LicenseApache-2.0
    Downloads/mo482K

    Research Paper

    Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

    arxiv.org

    Build a pipeline with Qianfan-OCR

    Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.

    Open Studio