The document graph extractor decomposes PDFs into spatial blocks — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump.
Pipeline Steps
Layout detection (if use_layout_detection) — find all document elements with layout_detector (pymupdf fast rule-based, or docling SOTA ML/DiT).
Block grouping — group lines into blocks using vertical_threshold / horizontal_threshold; drop blocks shorter than min_text_length.
Confidence scoring — assign base_confidence to native text; tag each block A/B/C/D.
VLM correction (if use_vlm_correction and below min_confidence_for_vlm) — re-read low-confidence blocks with vlm_provider/vlm_model. Skipped entirely in fast_mode.
Text embedding (if run_text_embedding) — embed block text with E5-Large (1024-d).
Thumbnails (if generate_thumbnails) — render full-page and/or per-block thumbnails per thumbnail_mode.
Output — one document per block with layout class, bbox, text, confidence, and optional embedding/thumbnails.
When to Use
Use Case Description Archival/scanned PDFs OCR + layout recovery with VLM correction for noisy scans Forms & tables Classify and isolate form fields, tables, and structured regions Spatial search Retrieve blocks by location and type, not just text Confidence-gated pipelines Route low-confidence blocks (tags C/D) to review
When NOT to Use
Scenario Recommended Alternative Plain text you already have text_extractorWhole-document multimodal embedding multimodal_extractor / universal_extractorMaximum throughput, no spatial detail text_extractor, or this extractor with fast_mode: trueNon-PDF inputs This extractor is PDF-only
Field Type Required Description pdfstring Yes URL or path to the PDF file. Populated from input_mappings.
{
"pdf" : "s3://my-bucket/contracts/lease.pdf"
}
Supported input types: PDF (max 1 PDF per object).
Output Schema
One document per detected block:
Field Type Description page_numberinteger Page number in the original PDF (1-indexed) object_typeenum paragraph, table, form, list, header, footer, figure, or handwrittenblock_indexinteger Block index within the page (0-indexed) bboxobject Bounding box: { x0, y0, x1, y1 } text_rawstring Original extracted text text_correctedstring | null Cleaned / VLM-corrected text overall_confidencefloat Confidence score 0.0–1.0 confidence_tagenum A (≥0.85), B (≥0.70), C (≥0.50), D (<0.50)document_graph_extractor_v1_text_embeddingfloat[1024] | null E5-Large block embedding (when run_text_embedding) thumbnail_url / segment_thumbnail_urlstring | null Full-page / per-block thumbnail URLs total_pagesinteger | null Total pages in the source PDF source_filestring | null Original source file name
{
"page_number" : 1 ,
"object_type" : "table" ,
"block_index" : 3 ,
"bbox" : { "x0" : 72.0 , "y0" : 220.4 , "x1" : 540.0 , "y1" : 410.9 },
"text_raw" : "Item Qty Price \n Widget 10 $4.00" ,
"text_corrected" : "Item | Qty | Price \n Widget | 10 | $4.00" ,
"overall_confidence" : 0.91 ,
"confidence_tag" : "A"
}
Parameters
Layout Detection
Parameter Type Default Range Description use_layout_detectionboolean true— Enable ML-based layout detection to find all elements (text, images, tables, figures) layout_detectorstring "pymupdf"pymupdf, doclingEngine. pymupdf: fast rule-based (~15 pages/sec). docling: SOTA ML/DiT (~3–8 sec/doc) vertical_thresholdfloat 15.01.0–100.0 Max vertical gap (points) between lines grouped into the same block horizontal_thresholdfloat 50.01.0–200.0 Max horizontal distance (points) for overlap detection min_text_lengthinteger 201–500 Minimum block text length (chars); filters noise/fragments
Confidence & VLM Correction
Parameter Type Default Range Description base_confidencefloat 0.850.0–1.0 Base confidence score for embedded (native) text min_confidence_for_vlmfloat 0.60.0–1.0 Threshold below which VLM correction is triggered use_vlm_correctionboolean true— Enable VLM correction for low-confidence blocks fast_modeboolean false— Skip VLM correction for max throughput (~15 pages/sec). Overrides use_vlm_correction vlm_providerstring "google"— LLM provider: google, openai, anthropic vlm_modelstring "gemini-2.5-flash"— Correction model, e.g. gemini-2.5-flash, gpt-4o, claude-3-5-sonnet llm_api_keystring | null null— BYOK key for VLM correction. Supports {{SECRET.openai_api_key}} references
Embedding & Rendering
Parameter Type Default Range Description run_text_embeddingboolean true— Generate E5-Large (1024-d) text embeddings for block content render_dpiinteger 15072–300 DPI for page rendering (used for VLM correction) generate_thumbnailsboolean true— Generate thumbnail images for blocks thumbnail_modestring "both"full_page, segment, bothWhich thumbnails to render thumbnail_dpiinteger 7236–150 DPI for thumbnail generation
Configuration Examples
Default (layout + VLM + embeddings)
Fast Mode (max throughput, no VLM)
High-Accuracy Scanned Docs (docling + VLM)
{
"feature_extractor" : {
"feature_extractor_name" : "document_graph_extractor" ,
"version" : "v1" ,
"input_mappings" : {
"pdf" : "file_url"
},
"parameters" : {}
}
}
Metric Value Cost 5 credits per page + 20 credits per VLM correction pymupdf throughput~15 pages/sec (rule-based) docling throughput~3–8 sec/doc (ML/DiT) Fast mode ~15 pages/sec (skips VLM correction)
Vector Index
Property Value Index name document_graph_extractor_v1_text_embeddingDimensions 1024 Type Dense Distance metric Cosine Inference model intfloat/multilingual-e5-large-instructNormalization L2 normalized
The embedding is optional. Set run_text_embedding: false for a layout-only extraction with no vector index.
Limitations
PDF only : Accepts a single PDF per object.
VLM cost : Each correction adds 20 credits; gate it with min_confidence_for_vlm or disable via fast_mode.
Layout-detector tradeoff : docling is more accurate but markedly slower than pymupdf.
External dependency : VLM correction depends on the selected provider’s availability.