NEWManaged multimodal retrieval.Explore platform →
    Document Understanding
    20 min read
    Updated 2026-05-20

    Structured Extraction from Unstructured Documents: How Vision-Language Models Replace OCR Pipelines

    A technical guide to extracting structured data from PDFs, invoices, forms, and tables using vision-language models. Covers how VLMs understand document layout natively, the architectures behind models like GOT-OCR, SmolDocling, and Granite Vision, and how to build extraction pipelines that handle mixed-content documents without templates.

    Document AI
    OCR
    VLM
    Structured Extraction
    Tables
    Forms

    The Problem: Documents Are Visual, But Pipelines Are Textual



    Every organization has a document problem. Invoices, contracts, insurance claims, scientific papers, tax forms -- the world produces trillions of pages annually, and the information locked inside them is structured (tables, key-value pairs, hierarchies) but stored in an unstructured format (pixels on a page).

    Traditional OCR pipelines treat this as a two-stage problem: first recognize characters, then apply rules to extract structure. That separation is the root of most failures. A table that spans two columns, a form field whose label is three lines above its value, a chart embedded next to a paragraph -- OCR sees characters. It does not see layout.

    The shift happening now is fundamental: vision-language models (VLMs) understand the entire page as a visual object and generate structured output directly. No intermediate text layer. No template rules. The model sees the document the way a human does -- spatially -- and reasons about what it contains.

    This guide covers the algorithms behind that shift, when each approach works, and how to build extraction pipelines that handle real-world documents.

    How Traditional OCR Works (and Where It Breaks)



    A classical document extraction pipeline has four stages:

    1. Image preprocessing -- deskewing, binarization, noise removal 2. Text detection -- finding bounding boxes of text regions (EAST, CRAFT, DBNet) 3. Text recognition -- converting each region to a character string (CRNN, TrOCR) 4. Post-processing -- rule-based or ML-based extraction of fields, tables, key-value pairs

    Each stage introduces compounding error. A slightly skewed scan produces bad bounding boxes, which produce garbled characters, which break downstream extraction rules. But the deeper issue is architectural: OCR pipelines discard spatial relationships between text regions. Once characters are recognized, the pipeline works with a flat text stream. The fact that "Total" appeared directly left of "$4,287.50" in a table row is lost unless you explicitly reconstruct it from coordinates.

    This works for simple documents with fixed templates (utility bills, standard invoices). It breaks on:

  1. Complex tables: merged cells, nested headers, spanning rows, tables without visible borders
  2. Mixed content: documents interleaving text, tables, charts, diagrams, and images
  3. Variable layouts: the same document type from different issuers with different formatting
  4. Handwriting: OCR accuracy drops from 95%+ to 60-80% on handwritten annotations
  5. Multilingual content: documents mixing scripts (Latin + CJK + Arabic) in the same page


  6. The industry response was to add more stages -- table detection models, layout analysis models, handwriting recognizers -- each with its own training data, failure modes, and maintenance burden. A production document extraction pipeline in 2024 might chain 6-8 separate models.

    The VLM Alternative: See the Page, Speak the Structure



    Vision-language models collapse that entire pipeline into a single forward pass. The architecture has two components:

    Visual Encoder



    A vision transformer (ViT) processes the document image. Unlike OCR's character-level detection, the ViT sees the full page at once. Its self-attention mechanism naturally captures spatial relationships: it knows that a column header is above a data cell, that a footnote is at the bottom of the page, that a checkbox is next to a label.

    Most document VLMs use high-resolution encoders (1024x1024 or higher) or tile-based approaches that split large documents into overlapping patches, process each patch, and fuse the representations. This preserves fine details like small text and thin table borders.

    Language Decoder



    A transformer decoder generates structured output conditioned on the visual representation. The output format varies:
  7. Markdown for general documents (headings, paragraphs, tables, lists)
  8. HTML for tables (preserving rowspan/colspan)
  9. JSON for key-value extraction
  10. LaTeX for mathematical content


  11. The key insight is that the model learns the mapping from visual layout to structured output end-to-end, from millions of document-output pairs. It does not need explicit rules for "if the bold text is above the table, it is the table title." It learns these patterns implicitly.

    Why This Works Better



    The information flow is fundamentally different:

    \`\`\` Traditional OCR: Image > Text Detection > Text Recognition > Flat Text > Rules > Structure

    VLM: Image > Visual Features > Structured Output (directly) \`\`\`

    The VLM never creates an intermediate flat-text representation. Structure is preserved from pixels to output because the visual encoder maintains spatial relationships and the decoder was trained to produce structured formats.

    Three Architectures for Document Extraction



    Not all VLM-based extractors work the same way. The field has converged on three distinct approaches, each with different tradeoffs.

    1. Encoder-Decoder with Structured Output Training



    Examples: GOT-OCR 2.0, Nougat, Donut

    These models are trained specifically for document-to-structure conversion. The training data pairs document images with their structured representations (Markdown, LaTeX, HTML). The model learns to "read" a document and produce the corresponding structured text.

    GOT-OCR 2.0 (General OCR Theory) is notable because it handles OCR, table recognition, formula recognition, and document layout in a single model. It uses a high-resolution visual encoder paired with a Qwen language decoder, trained on a mixture of:
  12. Scene text (signs, labels)
  13. Document text (printed, handwritten)
  14. Mathematical formulas (LaTeX output)
  15. Tables (HTML output)
  16. Sheet music, molecular diagrams, geometric shapes


  17. The model switches output format based on what it sees in the image, without explicit instructions.

    Tradeoffs:
  18. Fast inference (single pass)
  19. Limited to output formats seen in training
  20. Cannot follow arbitrary extraction instructions
  21. Best for: high-throughput document digitization


  22. 2. Instruction-Following Document VLMs



    Examples: InternVL, Qwen-VL, Granite Vision

    These are general-purpose VLMs fine-tuned on document understanding tasks. Unlike encoder-decoder models, they accept natural language instructions: "Extract all line items from this invoice as JSON" or "What is the total amount in the table on page 2?"

    The architecture adds an instruction-following capability on top of the visual understanding:

    \`\`\` Input: [Document Image] + [Instruction: "Extract the table as JSON"] Output: {"headers": ["Item", "Qty", "Price"], "rows": [...]} \`\`\`

    IBM Granite 4.0 3B Vision is a recent example designed specifically for enterprise document extraction. At only 3B parameters it is small enough for on-premise deployment while achieving competitive accuracy on table extraction and key-value parsing benchmarks.

    Tradeoffs:
  23. Flexible (any extraction task expressible in natural language)
  24. Slower (larger models, instruction parsing overhead)
  25. May hallucinate fields not present in the document
  26. Best for: variable document types, ad-hoc extraction tasks


  27. 3. Modular Document Understanding



    Examples: SmolDocling, Docling, LayoutLMv3

    These models decompose the problem differently from traditional OCR. Instead of character-to-text-to-rules, they do layout analysis, region classification, then per-region extraction. But unlike traditional pipelines, the layout analysis is learned end-to-end and region extraction uses specialized models for each content type.

    SmolDocling (256M parameters) is designed to be the "perception layer" in a larger pipeline. It produces a structured document representation:
  28. Identifies content types: text, table, figure, formula, code, list, heading
  29. Outputs bounding boxes and content type labels
  30. Generates structured output for each region (Markdown for text, HTML for tables, LaTeX for formulas)


  31. The key advantage is composability: SmolDocling handles layout and region extraction, but you can plug in specialized models for specific content types (a better table extractor, a domain-specific formula parser).

    Tradeoffs:
  32. Small and fast (256M vs. 3-8B for full VLMs)
  33. Composable with other models
  34. Requires orchestration logic
  35. Best for: pipeline architectures where you need fine-grained control


  36. Building an Extraction Pipeline



    A production extraction pipeline needs more than a model. Here is the architecture:

    Document Ingestion



    \`\`\` PDF/Image > Page Splitting > Resolution Normalization > Model Input \`\`\`

    PDFs are not images. A PDF may contain:
  37. Vector text (selectable, already structured)
  38. Rasterized text (scanned, needs visual processing)
  39. Embedded images (charts, logos, photographs)
  40. Mixed content (vector text around a scanned table)


  41. The first decision is whether to render the PDF to images (treating everything as visual) or extract vector text directly and only use the VLM for non-text content. The render-everything approach is simpler and more robust; the hybrid approach is faster when most content is vector text.

    For multi-page documents, process each page independently and merge results. Most VLMs operate on single pages. Cross-page tables require post-processing to detect continuation (look for repeated headers, missing top borders, or row indices that continue from the previous page).

    Extraction Strategy by Content Type



    Different content types need different extraction approaches:

    Tables: Tables are the hardest extraction target. The model must identify row/column structure, handle merged cells, distinguish headers from data, and parse cell contents. HTML output format preserves structure better than Markdown for complex tables because it supports rowspan and colspan attributes.

    For high-accuracy table extraction, a two-pass approach works well: 1. First pass: detect table regions and extract as HTML 2. Second pass: validate the HTML table structure (equal column counts, no orphan cells) and re-extract failures with a different prompt or model

    Key-Value Pairs: Forms and invoices have fields like "Invoice Number: 12345" or "Date: 2026-05-20". These are best extracted with an instruction-following VLM prompted to output JSON:

    \`\`\` Prompt: "Extract all labeled fields from this form as a JSON object. Keys should be the field labels, values should be the field contents." Output: {"Invoice Number": "12345", "Date": "2026-05-20", "Total": "$4,287.50"} \`\`\`

    Hierarchical Content: Contracts, reports, and papers have nested structure (sections, subsections, paragraphs). Markdown output captures this naturally. The model produces headings that map to the document hierarchy.

    Mathematical Content: Scientific papers and engineering documents contain formulas. LaTeX output is the standard. Models like GOT-OCR handle this natively; general VLMs may need explicit prompting ("output formulas as LaTeX").

    Post-Processing and Validation



    VLMs can hallucinate. A model asked to extract invoice fields may invent a "Discount" field that does not exist, or misread "$4,287.50" as "$4,287.00". Post-processing is essential:

    1. Schema validation: check that the output conforms to the expected structure (valid JSON, valid HTML table, required fields present) 2. Confidence scoring: some models output token-level probabilities. Low-confidence regions flag potential errors. 3. Cross-reference: if the document has redundant information (a line-item total that should equal the sum of quantities times unit prices), check arithmetic consistency 4. Human-in-the-loop: for high-stakes extraction (legal contracts, financial documents), route low-confidence extractions to human review

    Evaluation: How to Measure Extraction Quality



    You cannot improve what you do not measure. Document extraction has specific metrics:

    Table Extraction

  42. TEDS (Tree-Edit-Distance-based Similarity): measures structural similarity between predicted and ground-truth HTML tables. Accounts for both structure (rows, columns, spans) and content. Score ranges from 0 to 1.
  43. Cell-level F1: precision and recall on individual cell contents, ignoring structure


  44. Key-Value Extraction

  45. Field-level accuracy: percentage of fields where extracted value exactly matches ground truth
  46. Fuzzy match: Levenshtein distance for fields where minor character errors are acceptable


  47. Full-Document Extraction

  48. Markdown similarity: BLEU or ROUGE between predicted and ground-truth Markdown
  49. Layout-aware metrics: penalize structural errors (wrong heading level, misplaced table) more than content errors


  50. Benchmarks to Know

  51. DocVQA: question answering on documents (12,767 questions across 5,188 document images)
  52. TableBank: 417K table images for table detection and recognition
  53. FUNSD: 199 scanned forms for key-value extraction
  54. PubTabNet: 568K table images from scientific papers


  55. When to Use Which Approach



    ScenarioApproachWhy
    High-volume invoice processingEncoder-decoder (GOT-OCR)Speed matters, format is predictable
    Ad-hoc extraction from varied documentsInstruction-following VLMFlexibility to handle anything
    Pipeline with specialized downstream modelsModular (SmolDocling)Composability, fine-grained control
    Scanned handwritten formsInstruction-following VLMBest accuracy on degraded input
    Scientific papers to structured dataEncoder-decoder (Nougat/GOT-OCR)Native LaTeX and table support
    Mixed document types in same pipelineModular + routingRoute each page to the best model

    Mixpeek Implementation



    Mixpeek ingestion pipelines handle structured extraction as part of the indexing process. When a PDF or image enters the pipeline, the extraction model decomposes it into searchable features.

    Ingest: Extract Structure from Documents



    \`\`\`python from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_KEY")

    # Ingest a batch of documents with structured extraction mx.ingest( collection_id="contracts", source="s3://legal-docs/2026-q2/", extractors=[ { "type": "document_structure", "model": "docling-project/SmolDocling-256M-preview", "output_features": [ "page_layout", "table_html", "text_markdown", ] }, { "type": "ocr", "model": "stepfun-ai/GOT-OCR-2.0-hf", "output_features": [ "full_text", "formula_latex", ] }, { "type": "text_embedding", "model": "BAAI/bge-m3", "input_field": "text_markdown", "output_feature": "text_embedding" } ] ) \`\`\`

    This pipeline runs two extraction models in parallel (SmolDocling for layout and tables, GOT-OCR for full text and formulas), then embeds the extracted Markdown for semantic search. Each document page becomes a searchable object with structured features.

    Retrieve: Search Extracted Content



    \`\`\`python results = await mx.retrievers.retrieve( queries=[{ "type": "text", "value": "limitation of liability not exceeding total fees paid" }], collection_ids=["contracts"], stages=[ {"type": "feature_search", "feature": "text_embedding", "top_k": 50}, {"type": "filter", "condition": {"table_html": {"$exists": True}}}, {"type": "rerank", "model": "Qwen/Qwen3-Reranker-0.6B", "top_k": 10} ] ) \`\`\`

    The pipeline searches over embedded text but can filter by structural features (only pages containing tables, only sections with specific heading patterns). This combines the semantic understanding of embeddings with the structural awareness of document extraction.

    Common Failure Modes and Mitigations



    Failure: Tables rendered as images inside PDFs Some PDF generators rasterize tables as images. The VLM handles this fine (it sees images natively), but if your pipeline extracts vector text first, it will miss these tables entirely. Mitigation: always check for embedded images and route them through the visual pipeline.

    Failure: Multi-page tables Most VLMs process one page at a time. A table that spans pages 3-5 produces three partial tables. Mitigation: detect continuation tables (no header row, consistent column widths) and merge them in post-processing.

    Failure: Low-resolution scans Fax-quality scans (100-150 DPI) degrade extraction accuracy significantly. Mitigation: super-resolution preprocessing (Real-ESRGAN or similar) before extraction, or use models trained on degraded inputs.

    Failure: Handwritten annotations on printed forms The model may confuse handwritten and printed text, or miss handwritten content entirely. Mitigation: use a handwriting-aware model (GOT-OCR handles this) and validate that all visible text regions produced output.

    Failure: Hallucinated fields Instruction-following VLMs may generate plausible but nonexistent fields. Mitigation: schema validation (reject fields not in the expected schema), confidence thresholding, and cross-referencing against the raw OCR output.

    Related Guides



  56. Visual Document Retrieval -- searching documents by visual similarity without OCR
  57. Open-Vocabulary Object Detection -- detecting arbitrary objects in document images
  58. Multi-Stage Retrieval -- combining structural and semantic search
  59. Cross-Encoder Reranking -- precision reranking for document search results
  60. Models -- browse OCR, document structure, and embedding models for document pipelines
  61. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs