The Problem: Documents Are Visual, But Pipelines Are Textual
Every organization has a document problem. Invoices, contracts, insurance claims, scientific papers, tax forms -- the world produces trillions of pages annually, and the information locked inside them is structured (tables, key-value pairs, hierarchies) but stored in an unstructured format (pixels on a page).
Traditional OCR pipelines treat this as a two-stage problem: first recognize characters, then apply rules to extract structure. That separation is the root of most failures. A table that spans two columns, a form field whose label is three lines above its value, a chart embedded next to a paragraph -- OCR sees characters. It does not see layout.
The shift happening now is fundamental: vision-language models (VLMs) understand the entire page as a visual object and generate structured output directly. No intermediate text layer. No template rules. The model sees the document the way a human does -- spatially -- and reasons about what it contains.
This guide covers the algorithms behind that shift, when each approach works, and how to build extraction pipelines that handle real-world documents.
How Traditional OCR Works (and Where It Breaks)
A classical document extraction pipeline has four stages:
1. Image preprocessing -- deskewing, binarization, noise removal 2. Text detection -- finding bounding boxes of text regions (EAST, CRAFT, DBNet) 3. Text recognition -- converting each region to a character string (CRNN, TrOCR) 4. Post-processing -- rule-based or ML-based extraction of fields, tables, key-value pairs
Each stage introduces compounding error. A slightly skewed scan produces bad bounding boxes, which produce garbled characters, which break downstream extraction rules. But the deeper issue is architectural: OCR pipelines discard spatial relationships between text regions. Once characters are recognized, the pipeline works with a flat text stream. The fact that "Total" appeared directly left of "$4,287.50" in a table row is lost unless you explicitly reconstruct it from coordinates.
This works for simple documents with fixed templates (utility bills, standard invoices). It breaks on:
The industry response was to add more stages -- table detection models, layout analysis models, handwriting recognizers -- each with its own training data, failure modes, and maintenance burden. A production document extraction pipeline in 2024 might chain 6-8 separate models.
The VLM Alternative: See the Page, Speak the Structure
Vision-language models collapse that entire pipeline into a single forward pass. The architecture has two components:
Visual Encoder
A vision transformer (ViT) processes the document image. Unlike OCR's character-level detection, the ViT sees the full page at once. Its self-attention mechanism naturally captures spatial relationships: it knows that a column header is above a data cell, that a footnote is at the bottom of the page, that a checkbox is next to a label.
Most document VLMs use high-resolution encoders (1024x1024 or higher) or tile-based approaches that split large documents into overlapping patches, process each patch, and fuse the representations. This preserves fine details like small text and thin table borders.
Language Decoder
A transformer decoder generates structured output conditioned on the visual representation. The output format varies:
The key insight is that the model learns the mapping from visual layout to structured output end-to-end, from millions of document-output pairs. It does not need explicit rules for "if the bold text is above the table, it is the table title." It learns these patterns implicitly.
Why This Works Better
The information flow is fundamentally different:
\`\`\` Traditional OCR: Image > Text Detection > Text Recognition > Flat Text > Rules > Structure
VLM: Image > Visual Features > Structured Output (directly) \`\`\`
The VLM never creates an intermediate flat-text representation. Structure is preserved from pixels to output because the visual encoder maintains spatial relationships and the decoder was trained to produce structured formats.
Three Architectures for Document Extraction
Not all VLM-based extractors work the same way. The field has converged on three distinct approaches, each with different tradeoffs.
1. Encoder-Decoder with Structured Output Training
Examples: GOT-OCR 2.0, Nougat, Donut
These models are trained specifically for document-to-structure conversion. The training data pairs document images with their structured representations (Markdown, LaTeX, HTML). The model learns to "read" a document and produce the corresponding structured text.
GOT-OCR 2.0 (General OCR Theory) is notable because it handles OCR, table recognition, formula recognition, and document layout in a single model. It uses a high-resolution visual encoder paired with a Qwen language decoder, trained on a mixture of:
The model switches output format based on what it sees in the image, without explicit instructions.
Tradeoffs:
2. Instruction-Following Document VLMs
Examples: InternVL, Qwen-VL, Granite Vision
These are general-purpose VLMs fine-tuned on document understanding tasks. Unlike encoder-decoder models, they accept natural language instructions: "Extract all line items from this invoice as JSON" or "What is the total amount in the table on page 2?"
The architecture adds an instruction-following capability on top of the visual understanding:
\`\`\` Input: [Document Image] + [Instruction: "Extract the table as JSON"] Output: {"headers": ["Item", "Qty", "Price"], "rows": [...]} \`\`\`
IBM Granite 4.0 3B Vision is a recent example designed specifically for enterprise document extraction. At only 3B parameters it is small enough for on-premise deployment while achieving competitive accuracy on table extraction and key-value parsing benchmarks.
Tradeoffs:
3. Modular Document Understanding
Examples: SmolDocling, Docling, LayoutLMv3
These models decompose the problem differently from traditional OCR. Instead of character-to-text-to-rules, they do layout analysis, region classification, then per-region extraction. But unlike traditional pipelines, the layout analysis is learned end-to-end and region extraction uses specialized models for each content type.
SmolDocling (256M parameters) is designed to be the "perception layer" in a larger pipeline. It produces a structured document representation:
The key advantage is composability: SmolDocling handles layout and region extraction, but you can plug in specialized models for specific content types (a better table extractor, a domain-specific formula parser).
Tradeoffs:
Building an Extraction Pipeline
A production extraction pipeline needs more than a model. Here is the architecture:
Document Ingestion
\`\`\` PDF/Image > Page Splitting > Resolution Normalization > Model Input \`\`\`
PDFs are not images. A PDF may contain:
The first decision is whether to render the PDF to images (treating everything as visual) or extract vector text directly and only use the VLM for non-text content. The render-everything approach is simpler and more robust; the hybrid approach is faster when most content is vector text.
For multi-page documents, process each page independently and merge results. Most VLMs operate on single pages. Cross-page tables require post-processing to detect continuation (look for repeated headers, missing top borders, or row indices that continue from the previous page).
Extraction Strategy by Content Type
Different content types need different extraction approaches:
Tables: Tables are the hardest extraction target. The model must identify row/column structure, handle merged cells, distinguish headers from data, and parse cell contents. HTML output format preserves structure better than Markdown for complex tables because it supports rowspan and colspan attributes.
For high-accuracy table extraction, a two-pass approach works well: 1. First pass: detect table regions and extract as HTML 2. Second pass: validate the HTML table structure (equal column counts, no orphan cells) and re-extract failures with a different prompt or model
Key-Value Pairs: Forms and invoices have fields like "Invoice Number: 12345" or "Date: 2026-05-20". These are best extracted with an instruction-following VLM prompted to output JSON:
\`\`\` Prompt: "Extract all labeled fields from this form as a JSON object. Keys should be the field labels, values should be the field contents." Output: {"Invoice Number": "12345", "Date": "2026-05-20", "Total": "$4,287.50"} \`\`\`
Hierarchical Content: Contracts, reports, and papers have nested structure (sections, subsections, paragraphs). Markdown output captures this naturally. The model produces headings that map to the document hierarchy.
Mathematical Content: Scientific papers and engineering documents contain formulas. LaTeX output is the standard. Models like GOT-OCR handle this natively; general VLMs may need explicit prompting ("output formulas as LaTeX").
Post-Processing and Validation
VLMs can hallucinate. A model asked to extract invoice fields may invent a "Discount" field that does not exist, or misread "$4,287.50" as "$4,287.00". Post-processing is essential:
1. Schema validation: check that the output conforms to the expected structure (valid JSON, valid HTML table, required fields present) 2. Confidence scoring: some models output token-level probabilities. Low-confidence regions flag potential errors. 3. Cross-reference: if the document has redundant information (a line-item total that should equal the sum of quantities times unit prices), check arithmetic consistency 4. Human-in-the-loop: for high-stakes extraction (legal contracts, financial documents), route low-confidence extractions to human review
Evaluation: How to Measure Extraction Quality
You cannot improve what you do not measure. Document extraction has specific metrics:
Table Extraction
Key-Value Extraction
Full-Document Extraction
Benchmarks to Know
When to Use Which Approach
| Scenario | Approach | Why |
| High-volume invoice processing | Encoder-decoder (GOT-OCR) | Speed matters, format is predictable |
| Ad-hoc extraction from varied documents | Instruction-following VLM | Flexibility to handle anything |
| Pipeline with specialized downstream models | Modular (SmolDocling) | Composability, fine-grained control |
| Scanned handwritten forms | Instruction-following VLM | Best accuracy on degraded input |
| Scientific papers to structured data | Encoder-decoder (Nougat/GOT-OCR) | Native LaTeX and table support |
| Mixed document types in same pipeline | Modular + routing | Route each page to the best model |
Mixpeek Implementation
Mixpeek ingestion pipelines handle structured extraction as part of the indexing process. When a PDF or image enters the pipeline, the extraction model decomposes it into searchable features.
Ingest: Extract Structure from Documents
\`\`\`python from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_KEY")
# Ingest a batch of documents with structured extraction mx.ingest( collection_id="contracts", source="s3://legal-docs/2026-q2/", extractors=[ { "type": "document_structure", "model": "docling-project/SmolDocling-256M-preview", "output_features": [ "page_layout", "table_html", "text_markdown", ] }, { "type": "ocr", "model": "stepfun-ai/GOT-OCR-2.0-hf", "output_features": [ "full_text", "formula_latex", ] }, { "type": "text_embedding", "model": "BAAI/bge-m3", "input_field": "text_markdown", "output_feature": "text_embedding" } ] ) \`\`\`
This pipeline runs two extraction models in parallel (SmolDocling for layout and tables, GOT-OCR for full text and formulas), then embeds the extracted Markdown for semantic search. Each document page becomes a searchable object with structured features.
Retrieve: Search Extracted Content
\`\`\`python results = await mx.retrievers.retrieve( queries=[{ "type": "text", "value": "limitation of liability not exceeding total fees paid" }], collection_ids=["contracts"], stages=[ {"type": "feature_search", "feature": "text_embedding", "top_k": 50}, {"type": "filter", "condition": {"table_html": {"$exists": True}}}, {"type": "rerank", "model": "Qwen/Qwen3-Reranker-0.6B", "top_k": 10} ] ) \`\`\`
The pipeline searches over embedded text but can filter by structural features (only pages containing tables, only sections with specific heading patterns). This combines the semantic understanding of embeddings with the structural awareness of document extraction.
Common Failure Modes and Mitigations
Failure: Tables rendered as images inside PDFs Some PDF generators rasterize tables as images. The VLM handles this fine (it sees images natively), but if your pipeline extracts vector text first, it will miss these tables entirely. Mitigation: always check for embedded images and route them through the visual pipeline.
Failure: Multi-page tables Most VLMs process one page at a time. A table that spans pages 3-5 produces three partial tables. Mitigation: detect continuation tables (no header row, consistent column widths) and merge them in post-processing.
Failure: Low-resolution scans Fax-quality scans (100-150 DPI) degrade extraction accuracy significantly. Mitigation: super-resolution preprocessing (Real-ESRGAN or similar) before extraction, or use models trained on degraded inputs.
Failure: Handwritten annotations on printed forms The model may confuse handwritten and printed text, or miss handwritten content entirely. Mitigation: use a handwriting-aware model (GOT-OCR handles this) and validate that all visible text regions produced output.
Failure: Hallucinated fields Instruction-following VLMs may generate plausible but nonexistent fields. Mitigation: schema validation (reject fields not in the expected schema), confidence thresholding, and cross-referencing against the raw OCR output.