Extract spatial blocks from PDFs with layout classification, confidence scoring, optional VLM correction, and 1024-d E5 text embeddings
View on GitHub
Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
The document graph extractor decomposes PDFs into spatial blocks — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump.
Layout detection (if use_layout_detection, the default) — find all document elements with layout_detector (pymupdf fast rule-based, or docling SOTA ML/DiT). Detects text regions and non-text elements (scanned images, figures, charts) as separate blocks.
Block grouping — when layout detection is off, group text spans into blocks using vertical_threshold / horizontal_threshold; drop blocks shorter than min_text_length.
Confidence scoring — assign base_confidence to native text (penalties for OCR artifacts / encoding issues), then tag each block A/B/C/D.
VLM correction (if use_vlm_correction and confidence < min_confidence_for_vlm) — re-read low-confidence blocks with vlm_provider/vlm_model. Skipped entirely in fast_mode.
Text embedding (if run_text_embedding) — embed block text with E5-Large (1024-d).
Enable ML-based layout detection to find all elements (text, images, tables, figures). When disabled, falls back to text-only extraction (faster, misses images).
layout_detector
string
"pymupdf"
pymupdf, docling
Engine used when use_layout_detection=true. pymupdf: fast rule-based (~15 pages/sec). docling: SOTA ML/DiT — better semantic type detection and true table structure (~3–8 sec/doc).
render_dpi
integer
150
72–300
DPI for page rendering (used for VLM correction). 72 fast/lower quality, 150 balanced, 300 high quality/slower.
Extracted block text (native or OCR/VLM-corrected) is embedded into the document_graph_extractor_v1_text_embedding index, so you search it with a feature_search stage against mixpeek://document_graph_extractor@v1/intfloat__multilingual_e5_large_instruct (input_mode: "text"). For a ready-to-copy retriever — including confidence and layout-type filtering — see Cookbook → Search OCR Text from Scanned PDFs.