Structured Extraction from Unstructured Documents: How Vision-Language Models Replace OCR Pipelines

The Problem: Documents Are Visual, But Pipelines Are Textual

Every organization has a document problem. Invoices, contracts, insurance claims, scientific papers, tax forms -- the world produces trillions of pages annually, and the information locked inside them is structured (tables, key-value pairs, hierarchies) but stored in an unstructured format (pixels on a page).

Traditional OCR pipelines treat this as a two-stage problem: first recognize characters, then apply rules to extract structure. That separation is the root of most failures. A table that spans two columns, a form field whose label is three lines above its value, a chart embedded next to a paragraph -- OCR sees characters. It does not see layout.

The shift happening now is fundamental: vision-language models (VLMs) understand the entire page as a visual object and generate structured output directly. No intermediate text layer. No template rules. The model sees the document the way a human does -- spatially -- and reasons about what it contains.

This guide covers the algorithms behind that shift, when each approach works, and how to build extraction pipelines that handle real-world documents.

How Traditional OCR Works (and Where It Breaks)

A classical document extraction pipeline has four stages:

1. Image preprocessing -- deskewing, binarization, noise removal 2. Text detection -- finding bounding boxes of text regions (EAST, CRAFT, DBNet) 3. Text recognition -- converting each region to a character string (CRNN, TrOCR) 4. Post-processing -- rule-based or ML-based extraction of fields, tables, key-value pairs

Each stage introduces compounding error. A slightly skewed scan produces bad bounding boxes, which produce garbled characters, which break downstream extraction rules. But the deeper issue is architectural: OCR pipelines discard spatial relationships between text regions. Once characters are recognized, the pipeline works with a flat text stream. The fact that "Total" appeared directly left of "$4,287.50" in a table row is lost unless you explicitly reconstruct it from coordinates.

This works for simple documents with fixed templates (utility bills, standard invoices). It breaks on:

Complex tables: merged cells, nested headers, spanning rows, tables without visible borders

Mixed content: documents interleaving text, tables, charts, diagrams, and images

Variable layouts: the same document type from different issuers with different formatting

Handwriting: OCR accuracy drops from 95%+ to 60-80% on handwritten annotations

Multilingual content: documents mixing scripts (Latin + CJK + Arabic) in the same page

The industry response was to add more stages -- table detection models, layout analysis models, handwriting recognizers -- each with its own training data, failure modes, and maintenance burden. A production document extraction pipeline in 2024 might chain 6-8 separate models.

The VLM Alternative: See the Page, Speak the Structure

Vision-language models collapse that entire pipeline into a single forward pass. The architecture has two components:

Visual Encoder

A vision transformer (ViT) processes the document image. Unlike OCR's character-level detection, the ViT sees the full page at once. Its self-attention mechanism naturally captures spatial relationships: it knows that a column header is above a data cell, that a footnote is at the bottom of the page, that a checkbox is next to a label.

Most document VLMs use high-resolution encoders (1024x1024 or higher) or tile-based approaches that split large documents into overlapping patches, process each patch, and fuse the representations. This preserves fine details like small text and thin table borders.

Language Decoder

A transformer decoder generates structured output conditioned on the visual representation. The output format varies:

Markdown for general documents (headings, paragraphs, tables, lists)

HTML for tables (preserving rowspan/colspan)

JSON for key-value extraction

LaTeX for mathematical content

The key insight is that the model learns the mapping from visual layout to structured output end-to-end, from millions of document-output pairs. It does not need explicit rules for "if the bold text is above the table, it is the table title." It learns these patterns implicitly.

Why This Works Better

The information flow is fundamentally different:

\\\` Traditional OCR: Image > Text Detection > Text Recognition > Flat Text > Rules > Structure

VLM: Image > Visual Features > Structured Output (directly) \\\`

The VLM never creates an intermediate flat-text representation. Structure is preserved from pixels to output because the visual encoder maintains spatial relationships and the decoder was trained to produce structured formats.

Three Architectures for Document Extraction

Not all VLM-based extractors work the same way. The field has converged on three distinct approaches, each with different tradeoffs.

1. Encoder-Decoder with Structured Output Training

Examples: GOT-OCR 2.0, Nougat, Donut

These models are trained specifically for document-to-structure conversion. The training data pairs document images with their structured representations (Markdown, LaTeX, HTML). The model learns to "read" a document and produce the corresponding structured text.

GOT-OCR 2.0 (General OCR Theory) is notable because it handles OCR, table recognition, formula recognition, and document layout in a single model. It uses a high-resolution visual encoder paired with a Qwen language decoder, trained on a mixture of:

Scene text (signs, labels)

Document text (printed, handwritten)

Mathematical formulas (LaTeX output)

Tables (HTML output)

Sheet music, molecular diagrams, geometric shapes

The model switches output format based on what it sees in the image, without explicit instructions.

Tradeoffs:

Fast inference (single pass)

Limited to output formats seen in training

Cannot follow arbitrary extraction instructions

Best for: high-throughput document digitization

2. Instruction-Following Document VLMs

Examples: InternVL, Qwen-VL, Granite Vision

These are general-purpose VLMs fine-tuned on document understanding tasks. Unlike encoder-decoder models, they accept natural language instructions: "Extract all line items from this invoice as JSON" or "What is the total amount in the table on page 2?"

The architecture adds an instruction-following capability on top of the visual understanding:

\\\` Input: [Document Image] + [Instruction: "Extract the table as JSON"] Output: {"headers": ["Item", "Qty", "Price"], "rows": [...]} \\\`

IBM Granite 4.0 3B Vision is a recent example designed specifically for enterprise document extraction. At only 3B parameters it is small enough for on-premise deployment while achieving competitive accuracy on table extraction and key-value parsing benchmarks.

Tradeoffs:

Flexible (any extraction task expressible in natural language)

Slower (larger models, instruction parsing overhead)

May hallucinate fields not present in the document

Best for: variable document types, ad-hoc extraction tasks

3. Modular Document Understanding

Examples: SmolDocling, Docling, LayoutLMv3

These models decompose the problem differently from traditional OCR. Instead of character-to-text-to-rules, they do layout analysis, region classification, then per-region extraction. But unlike traditional pipelines, the layout analysis is learned end-to-end and region extraction uses specialized models for each content type.

SmolDocling (256M parameters) is designed to be the "perception layer" in a larger pipeline. It produces a structured document representation:

Identifies content types: text, table, figure, formula, code, list, heading

Outputs bounding boxes and content type labels

Generates structured output for each region (Markdown for text, HTML for tables, LaTeX for formulas)

The key advantage is composability: SmolDocling handles layout and region extraction, but you can plug in specialized models for specific content types (a better table extractor, a domain-specific formula parser).

Tradeoffs:

Small and fast (256M vs. 3-8B for full VLMs)

Composable with other models

Requires orchestration logic

Best for: pipeline architectures where you need fine-grained control

Building an Extraction Pipeline

A production extraction pipeline needs more than a model. Here is the architecture:

Document Ingestion

\\\` PDF/Image > Page Splitting > Resolution Normalization > Model Input \\\`

PDFs are not images. A PDF may contain:

Vector text (selectable, already structured)

Rasterized text (scanned, needs visual processing)

Embedded images (charts, logos, photographs)

Mixed content (vector text around a scanned table)

The first decision is whether to render the PDF to images (treating everything as visual) or extract vector text directly and only use the VLM for non-text content. The render-everything approach is simpler and more robust; the hybrid approach is faster when most content is vector text.

For multi-page documents, process each page independently and merge results. Most VLMs operate on single pages. Cross-page tables require post-processing to detect continuation (look for repeated headers, missing top borders, or row indices that continue from the previous page).

Extraction Strategy by Content Type

Different content types need different extraction approaches:

Tables: Tables are the hardest extraction target. The model must identify row/column structure, handle merged cells, distinguish headers from data, and parse cell contents. HTML output format preserves structure better than Markdown for complex tables because it supports rowspan and colspan attributes.

For high-accuracy table extraction, a two-pass approach works well: 1. First pass: detect table regions and extract as HTML 2. Second pass: validate the HTML table structure (equal column counts, no orphan cells) and re-extract failures with a different prompt or model

Key-Value Pairs: Forms and invoices have fields like "Invoice Number: 12345" or "Date: 2026-05-20". These are best extracted with an instruction-following VLM prompted to output JSON:

\\\` Prompt: "Extract all labeled fields from this form as a JSON object. Keys should be the field labels, values should be the field contents." Output: {"Invoice Number": "12345", "Date": "2026-05-20", "Total": "$4,287.50"} \\\`

Hierarchical Content: Contracts, reports, and papers have nested structure (sections, subsections, paragraphs). Markdown output captures this naturally. The model produces headings that map to the document hierarchy.

Mathematical Content: Scientific papers and engineering documents contain formulas. LaTeX output is the standard. Models like GOT-OCR handle this natively; general VLMs may need explicit prompting ("output formulas as LaTeX").

Post-Processing and Validation

VLMs can hallucinate. A model asked to extract invoice fields may invent a "Discount" field that does not exist, or misread "$4,287.50" as "$4,287.00". Post-processing is essential:

1. Schema validation: check that the output conforms to the expected structure (valid JSON, valid HTML table, required fields present) 2. Confidence scoring: some models output token-level probabilities. Low-confidence regions flag potential errors. 3. Cross-reference: if the document has redundant information (a line-item total that should equal the sum of quantities times unit prices), check arithmetic consistency 4. Human-in-the-loop: for high-stakes extraction (legal contracts, financial documents), route low-confidence extractions to human review

Evaluation: How to Measure Extraction Quality

You cannot improve what you do not measure. Document extraction has specific metrics:

Table Extraction

TEDS (Tree-Edit-Distance-based Similarity): measures structural similarity between predicted and ground-truth HTML tables. Accounts for both structure (rows, columns, spans) and content. Score ranges from 0 to 1.

Cell-level F1: precision and recall on individual cell contents, ignoring structure

Key-Value Extraction

Field-level accuracy: percentage of fields where extracted value exactly matches ground truth

Fuzzy match: Levenshtein distance for fields where minor character errors are acceptable

Full-Document Extraction

Markdown similarity: BLEU or ROUGE between predicted and ground-truth Markdown

Layout-aware metrics: penalize structural errors (wrong heading level, misplaced table) more than content errors

Benchmarks to Know

DocVQA: question answering on documents (12,767 questions across 5,188 document images)

TableBank: 417K table images for table detection and recognition

FUNSD: 199 scanned forms for key-value extraction

PubTabNet: 568K table images from scientific papers

When to Use Which Approach

Scenario

Approach

Why

High-volume invoice processing	Encoder-decoder (GOT-OCR)	Speed matters, format is predictable
Ad-hoc extraction from varied documents	Instruction-following VLM	Flexibility to handle anything
Pipeline with specialized downstream models	Modular (SmolDocling)	Composability, fine-grained control
Scanned handwritten forms	Instruction-following VLM	Best accuracy on degraded input
Scientific papers to structured data	Encoder-decoder (Nougat/GOT-OCR)	Native LaTeX and table support
Mixed document types in same pipeline	Modular + routing	Route each page to the best model

Mixpeek Implementation

Mixpeek ingestion pipelines handle structured extraction as part of the indexing process. When a PDF or image enters the pipeline, the extraction model decomposes it into searchable features.

Ingest: Extract Structure from Documents

\\\`python from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_KEY")

# Ingest a batch of documents with structured extraction mx.ingest( collection_id="contracts", source="s3://legal-docs/2026-q2/", extractors=[ { "type": "document_structure", "model": "docling-project/SmolDocling-256M-preview", "output_features": [ "page_layout", "table_html", "text_markdown", ] }, { "type": "ocr", "model": "stepfun-ai/GOT-OCR-2.0-hf", "output_features": [ "full_text", "formula_latex", ] }, { "type": "text_embedding", "model": "BAAI/bge-m3", "input_field": "text_markdown", "output_feature": "text_embedding" } ] ) \\\`

This pipeline runs two extraction models in parallel (SmolDocling for layout and tables, GOT-OCR for full text and formulas), then embeds the extracted Markdown for semantic search. Each document page becomes a searchable object with structured features.

Retrieve: Search Extracted Content

\\\`python results = await mx.retrievers.execute( retriever_id="your-retriever-id", query="limitation of liability not exceeding total fees paid", ) \\\`

The pipeline searches over embedded text but can filter by structural features (only pages containing tables, only sections with specific heading patterns). This combines the semantic understanding of embeddings with the structural awareness of document extraction.

Common Failure Modes and Mitigations

Failure: Tables rendered as images inside PDFs Some PDF generators rasterize tables as images. The VLM handles this fine (it sees images natively), but if your pipeline extracts vector text first, it will miss these tables entirely. Mitigation: always check for embedded images and route them through the visual pipeline.

Failure: Multi-page tables Most VLMs process one page at a time. A table that spans pages 3-5 produces three partial tables. Mitigation: detect continuation tables (no header row, consistent column widths) and merge them in post-processing.

Failure: Low-resolution scans Fax-quality scans (100-150 DPI) degrade extraction accuracy significantly. Mitigation: super-resolution preprocessing (Real-ESRGAN or similar) before extraction, or use models trained on degraded inputs.

Failure: Handwritten annotations on printed forms The model may confuse handwritten and printed text, or miss handwritten content entirely. Mitigation: use a handwriting-aware model (GOT-OCR handles this) and validate that all visible text regions produced output.

Failure: Hallucinated fields Instruction-following VLMs may generate plausible but nonexistent fields. Mitigation: schema validation (reject fields not in the expected schema), confidence thresholding, and cross-referencing against the raw OCR output.

Related Guides

Visual Document Retrieval -- searching documents by visual similarity without OCR

Open-Vocabulary Object Detection -- detecting arbitrary objects in document images

Multi-Stage Retrieval -- combining structural and semantic search

Cross-Encoder Reranking -- precision reranking for document search results

Models -- browse OCR, document structure, and embedding models for document pipelines