Qianfan-OCR
by baidu
4B unified document intelligence model with Layout-as-Thought reasoning
baidu/Qianfan-OCRmixpeek://image_extractor@v1/baidu_qianfan_ocr_4b_v1Overview
Qianfan-OCR is Baidu's 4B-parameter end-to-end model that unifies document parsing, layout analysis, and document understanding within a single vision-language architecture. Its key innovation is Layout-as-Thought: an optional thinking phase where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final text output, recovering layout grounding capabilities lost by pure end-to-end approaches.
On Mixpeek, Qianfan-OCR extracts text with layout awareness from complex documents including multi-column pages, tables, and forms, powering structured document search where reading order and spatial relationships matter.
Architecture
Three-component VLM: Qianfan-ViT vision encoder (24 Transformer layers, AnyResolution up to 4K, 256 visual tokens per 448x448 tile), 2-layer MLP cross-modal adapter (1024-dim to 2560-dim with GELU), and Qwen3-4B language model backbone (36 layers, GQA 32/8, 32K context extendable to 131K). Layout-as-Thought via think tokens for structured layout prediction.
Mixpeek SDK Integration
from mixpeek import Mixpeekmx = Mixpeek(api_key="YOUR_KEY")mx.ingest(collection_id="enterprise-docs",source="s3://forms-and-tables/",extractors=[{"type": "ocr","model": "baidu/Qianfan-OCR","output_feature": "extracted_text"}])
Capabilities
- Top score on OmniDocBench v1.5 (93.12) among end-to-end models
- Layout-as-Thought reasoning for structured document understanding
- AnyResolution processing up to 4K images
- OCRBench score of 880 (ahead of Qwen3-VL-4B at 873)
- Apache 2.0 open-source
Use Cases on Mixpeek
Benchmarks
| Dataset | Metric | Score | Source |
|---|---|---|---|
| OmniDocBench v1.5 | Score | 93.12 | Baidu, March 2026 — Qianfan-OCR paper |
| OCRBench | Score | 880 | Baidu, March 2026 — Qianfan-OCR paper |
| DocVQA | Accuracy | 92.8% | Baidu, March 2026 — Qianfan-OCR paper |
Performance
Specification
Research Paper
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
arxiv.orgBuild a pipeline with Qianfan-OCR
Add this model to a processing pipeline alongside other extractors. Combine with retrieval stages for end-to-end search.
Open Studio