Qianfan-OCR

by baidu

4B unified document intelligence model with Layout-as-Thought reasoning

482Kdl/month

4Bparams

HuggingFace Run on your data

Identifiers

Model ID

baidu/Qianfan-OCR

Feature URI

mixpeek://image_extractor@v1/baidu_qianfan_ocr_4b_v1

Overview

Qianfan-OCR is Baidu's 4B-parameter end-to-end model that unifies document parsing, layout analysis, and document understanding within a single vision-language architecture. Its key innovation is Layout-as-Thought: an optional thinking phase where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final text output, recovering layout grounding capabilities lost by pure end-to-end approaches.

On Mixpeek, Qianfan-OCR extracts text with layout awareness from complex documents including multi-column pages, tables, and forms, powering structured document search where reading order and spatial relationships matter.

Architecture

Three-component VLM: Qianfan-ViT vision encoder (24 Transformer layers, AnyResolution up to 4K, 256 visual tokens per 448x448 tile), 2-layer MLP cross-modal adapter (1024-dim to 2560-dim with GELU), and Qwen3-4B language model backbone (36 layers, GQA 32/8, 32K context extendable to 131K). Layout-as-Thought via think tokens for structured layout prediction.

Mixpeek SDK Integration

import { Mixpeek } from "mixpeek";

const mx = new Mixpeek({ apiKey: "API_KEY" });

// Managed: create a collection over a bucket; Mixpeek runs this model's extractor
const collection = await mx.collections.create({
  namespace_id: "my-namespace",
  collection_name: "my-collection",
  source: { type: "bucket", bucket_ids: ["bkt_your_bucket"] },
  feature_extractor: {
    feature_extractor_name: "ocr",
    version: "v1",
    parameters: { model_id: "baidu/Qianfan-OCR" },
  },
});

Capabilities

Top score on OmniDocBench v1.5 (93.12) among end-to-end models
Layout-as-Thought reasoning for structured document understanding
AnyResolution processing up to 4K images
OCRBench score of 880 (ahead of Qwen3-VL-4B at 873)
Apache 2.0 open-source

Use Cases on Mixpeek

Complex document parsing where layout and reading order matter

Table extraction and structured data output from scanned forms

Enterprise document search with layout-aware text extraction

Benchmarks

Dataset	Metric	Score	Source
OmniDocBench v1.5	Score	93.12	Baidu, March 2026: Qianfan-OCR paper
OCRBench	Score	880	Baidu, March 2026: Qianfan-OCR paper
DocVQA	Accuracy	92.8%	Baidu, March 2026: Qianfan-OCR paper