Document Graph Extractor

Built-in extractor names are a deprecated alias — collections are now created by picking features. This pipeline is selected with features: ["document_layout"] (layout/structure extraction on top of the document_search base). Existing feature_extractor configs keep working; see the migration guide.

View on GitHub

Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.

Document graph extractor pipeline showing PDF parsing, layout detection, and block extraction

The document graph extractor decomposes PDFs into spatial blocks — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump.

View extractor details at api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

Layout detection (if use_layout_detection, the default) — find all document elements with layout_detector (pymupdf fast rule-based, or docling SOTA ML/DiT). Detects text regions and non-text elements (scanned images, figures, charts) as separate blocks.
Block grouping — when layout detection is off, group text spans into blocks using vertical_threshold / horizontal_threshold; drop blocks shorter than min_text_length.
Confidence scoring — assign base_confidence to native text (penalties for OCR artifacts / encoding issues), then tag each block A/B/C/D.
VLM correction (if use_vlm_correction and confidence < min_confidence_for_vlm) — re-read low-confidence blocks with vlm_provider/vlm_model. Skipped entirely in fast_mode.
Text embedding (if run_text_embedding) — embed block text with E5-Large (1024-d).
Thumbnails (if generate_thumbnails) — render full-page and/or per-block thumbnails per thumbnail_mode.
Output — one document per block with layout class, bbox, text, confidence, and optional embedding/thumbnails.

When to Use

Use Case	Description
Archival / scanned PDFs	OCR + layout recovery with VLM correction for noisy scans and historical records
Forms & tables	Classify and isolate form fields, tables, and structured regions
Spatial search	Retrieve blocks by location and type, not just text
Confidence-gated pipelines	Route low-confidence blocks (tags C/D) to review
Multi-layout documents	Reports, contracts, and multi-column layouts that need block-level granularity

When NOT to Use

Scenario	Recommended Alternative
Plain text you already have, or born-digital PDFs with perfect text	`text_extractor` (faster, simpler)
Whole-document multimodal embedding	`multimodal_extractor` / `universal_extractor`
Maximum throughput, no spatial detail	`text_extractor`, or this extractor with `fast_mode: true`
Images only	`image_extractor`
Non-PDF inputs	This extractor is PDF-only

Input Schema

Field	Type	Required	Description
`pdf`	string	Yes	URL or path to the PDF file. Supports multi-page PDFs. Populated from `input_mappings`.

{
  "pdf": "s3://my-bucket/contracts/lease.pdf"
}

Input Examples:

Type	Example
Invoice	`s3://documents/invoices/inv-001.pdf`
Contract	`https://cdn.example.com/contracts/lease.pdf`
Scanned document	`s3://archive/scanned/1985-report.pdf`
Form	`s3://forms/application-form.pdf`

Supported input types: PDF only (max 1 PDF per object). Max file size 100MB. For scanned documents, 150–300 DPI originals give the best OCR.

Output Schema

One document per detected block:

Field	Type	Description
`page_number`	integer	Page number in the original PDF (1-indexed)
`object_type`	enum	`paragraph`, `table`, `form`, `list`, `header`, `footer`, `figure`, or `handwritten`
`block_index`	integer	Block index within the page (0-indexed)
`bbox`	object	Bounding box: `{ x0, y0, x1, y1 }`
`text_raw`	string	Original extracted text
`text_corrected`	string \| null	Cleaned / VLM-corrected text
`overall_confidence`	float	Confidence score 0.0–1.0
`confidence_tag`	enum	`A` (`≥0.85`), `B` (`≥0.70`), `C` (`≥0.50`), `D` (`<0.50`)
`document_graph_extractor_v1_text_embedding`	float[1024] \| null	E5-Large block embedding (when `run_text_embedding`)
`thumbnail_url` / `segment_thumbnail_url`	string \| null	Full-page / per-block thumbnail URLs
`total_pages`	integer \| null	Total pages in the source PDF
`source_file`	string \| null	Original source file name

{
  "page_number": 1,
  "object_type": "table",
  "block_index": 3,
  "bbox": { "x0": 72.0, "y0": 220.4, "x1": 540.0, "y1": 410.9 },
  "text_raw": "Item  Qty  Price\nWidget  10  $4.00",
  "text_corrected": "Item | Qty | Price\nWidget | 10 | $4.00",
  "overall_confidence": 0.91,
  "confidence_tag": "A"
}

Parameters

Layout Detection

Parameter	Type	Default	Range	Description
`use_layout_detection`	boolean	`true`	—	Enable ML-based layout detection to find all elements (text, images, tables, figures). When disabled, falls back to text-only extraction (faster, misses images).
`layout_detector`	string	`"pymupdf"`	`pymupdf`, `docling`	Engine used when `use_layout_detection=true`. `pymupdf`: fast rule-based (~15 pages/sec). `docling`: SOTA ML/DiT — better semantic type detection and true table structure (~3–8 sec/doc).
`render_dpi`	integer	`150`	72–300	DPI for page rendering (used for VLM correction). 72 fast/lower quality, 150 balanced, 300 high quality/slower.

Spatial Clustering (text-only fallback)

Only used when use_layout_detection=false.

Parameter	Type	Default	Range	Description
`vertical_threshold`	float	`15.0`	1.0–100.0	Max vertical gap (points) between lines grouped into the same block
`horizontal_threshold`	float	`50.0`	1.0–200.0	Max horizontal distance (points) for overlap/column detection
`min_text_length`	integer	`20`	1–500	Minimum block text length (chars); filters noise/fragments

Confidence & VLM Correction

Parameter	Type	Default	Range	Description
`base_confidence`	float	`0.85`	0.0–1.0	Base confidence score for embedded (native) text
`min_confidence_for_vlm`	float	`0.6`	0.0–1.0	Confidence threshold below which VLM correction is triggered (only when `use_vlm_correction=true`)
`use_vlm_correction`	boolean	`true`	—	Enable VLM correction for low-confidence blocks
`fast_mode`	boolean	`false`	—	Skip VLM correction entirely for max throughput (~15 pages/sec). Overrides `use_vlm_correction`
`vlm_provider`	string	`"google"`	`google`, `openai`, `anthropic`	LLM provider for VLM correction
`vlm_model`	string	`"gemini-2.5-flash"`	—	Correction model, e.g. `gemini-2.5-flash`, `gpt-4o`, `claude-3-5-sonnet`
`llm_api_key`	string \| null	`null`	—	BYOK key for VLM correction. Supports secret references, e.g. `{{SECRET.openai_api_key}}`. Falls back to Mixpeek’s default keys if unset.

Embedding & Thumbnails

Parameter	Type	Default	Range	Description
`run_text_embedding`	boolean	`true`	—	Generate E5-Large (1024-d) text embeddings for block content
`generate_thumbnails`	boolean	`true`	—	Generate thumbnail images for blocks
`thumbnail_mode`	string	`"both"`	`full_page`, `segment`, `both`	Which thumbnails to render (`segment` = cropped to the block’s bbox)
`thumbnail_dpi`	integer	`72`	36–150	DPI for thumbnail generation. Lower DPI = smaller files.

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "file_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.invoice_id" },
      { "source_path": "metadata.vendor" }
    ],
    "parameters": {}
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "file_url"
    },
    "parameters": {
      "fast_mode": true,
      "generate_thumbnails": false
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "scanned_doc"
    },
    "parameters": {
      "layout_detector": "docling",
      "min_confidence_for_vlm": 0.75,
      "vlm_provider": "anthropic",
      "vlm_model": "claude-3-5-sonnet",
      "render_dpi": 200
    }
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "pdf_url"
    },
    "parameters": {
      "use_layout_detection": false,
      "vertical_threshold": 15.0,
      "horizontal_threshold": 50.0,
      "run_text_embedding": true
    }
  }
}

Layout Types

The extractor classifies blocks into these object_type values:

Type	Description	Example
`paragraph`	Body text blocks	Article content, descriptions
`table`	Tabular data	Financial tables, data grids
`form`	Form fields and labels	Application forms, surveys
`list`	Bulleted or numbered lists	Requirements, instructions
`header`	Page headers	Document titles, section headers
`footer`	Page footers	Page numbers, disclaimers
`figure`	Images and captions	Charts, diagrams, photos
`handwritten`	Handwritten text	Signatures, annotations

Confidence Tags

Extraction quality is graded with confidence tags (thresholds on overall_confidence):

Tag	Confidence	Description	Action
A	≥ 0.85	Excellent	Use directly
B	≥ 0.70	Good	Reliable, minor issues
C	≥ 0.50	Fair	Verify or trigger VLM correction
D	< 0.50	Poor	Needs VLM correction

Performance & Costs

Metric	Value
Cost	Billed per page — see Billing & Pricing; rates come from `GET /v1/billing/pricing`
`pymupdf` throughput	~15 pages/sec (rule-based)
`docling` throughput	~3–8 sec/doc (ML/DiT)
Fast mode	~15 pages/sec (skips VLM correction)

Vector Index

Property	Value
Index name	`document_graph_extractor_v1_text_embedding`
Dimensions	1024
Type	Dense
Distance metric	Cosine
Inference model	`multilingual_e5_large_instruct_v1` (`intfloat/multilingual-e5-large-instruct`)
Normalization	L2 normalized

The embedding is optional. Set run_text_embedding: false for a layout-only extraction with no vector index.

Limitations

PDF only: Accepts a single PDF per object; does not process Word docs, images, or other formats.
VLM cost: Each correction adds cost (see Billing & Pricing); gate it with min_confidence_for_vlm or disable via fast_mode.
Layout-detector tradeoff: docling is more accurate but markedly slower than pymupdf.
Memory: Large PDFs (100+ pages) may require increased memory.
Language / handwriting: OCR works best with Latin scripts; handwriting detection is experimental and less reliable.
External dependency: VLM correction depends on the selected provider’s availability.

Search the Extracted Text

Extracted block text (native or OCR/VLM-corrected) is embedded into the document_graph_extractor_v1_text_embedding index, so you search it with a feature_search stage against mixpeek://document_graph_extractor@v1/intfloat__multilingual_e5_large_instruct (input_mode: "text"). For a ready-to-copy retriever — including confidence and layout-type filtering — see Cookbook → Search OCR Text from Scanned PDFs.

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Document Graph Extractor

View on GitHub

Pipeline Steps

When to Use

When NOT to Use

Input Schema

Output Schema

Parameters

Layout Detection

Spatial Clustering (text-only fallback)

Confidence & VLM Correction

Embedding & Thumbnails

Configuration Examples

Layout Types

Confidence Tags

Performance & Costs

Vector Index

Limitations

Search the Extracted Text

View on GitHub

​Pipeline Steps

​When to Use

​When NOT to Use

​Input Schema

​Output Schema

​Parameters

​Layout Detection

​Spatial Clustering (text-only fallback)

​Confidence & VLM Correction

​Embedding & Thumbnails

​Configuration Examples

​Layout Types

​Confidence Tags

​Performance & Costs

​Vector Index

​Limitations

​Search the Extracted Text

​Related

Pipeline Steps

When to Use

When NOT to Use

Input Schema

Output Schema

Parameters

Layout Detection

Spatial Clustering (text-only fallback)

Confidence & VLM Correction

Embedding & Thumbnails

Configuration Examples

Layout Types

Confidence Tags

Performance & Costs

Vector Index

Limitations

Search the Extracted Text

Related