Skip to main content

View on GitHub

Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
Document graph extractor pipeline showing PDF parsing, layout detection, and block extraction
The document graph extractor decomposes PDFs into spatial blocks — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Layout detection (if use_layout_detection, the default) — find all document elements with layout_detector (pymupdf fast rule-based, or docling SOTA ML/DiT). Detects text regions and non-text elements (scanned images, figures, charts) as separate blocks.
  2. Block grouping — when layout detection is off, group text spans into blocks using vertical_threshold / horizontal_threshold; drop blocks shorter than min_text_length.
  3. Confidence scoring — assign base_confidence to native text (penalties for OCR artifacts / encoding issues), then tag each block A/B/C/D.
  4. VLM correction (if use_vlm_correction and confidence < min_confidence_for_vlm) — re-read low-confidence blocks with vlm_provider/vlm_model. Skipped entirely in fast_mode.
  5. Text embedding (if run_text_embedding) — embed block text with E5-Large (1024-d).
  6. Thumbnails (if generate_thumbnails) — render full-page and/or per-block thumbnails per thumbnail_mode.
  7. Output — one document per block with layout class, bbox, text, confidence, and optional embedding/thumbnails.

When to Use

Use CaseDescription
Archival / scanned PDFsOCR + layout recovery with VLM correction for noisy scans and historical records
Forms & tablesClassify and isolate form fields, tables, and structured regions
Spatial searchRetrieve blocks by location and type, not just text
Confidence-gated pipelinesRoute low-confidence blocks (tags C/D) to review
Multi-layout documentsReports, contracts, and multi-column layouts that need block-level granularity

When NOT to Use

ScenarioRecommended Alternative
Plain text you already have, or born-digital PDFs with perfect texttext_extractor (faster, simpler)
Whole-document multimodal embeddingmultimodal_extractor / universal_extractor
Maximum throughput, no spatial detailtext_extractor, or this extractor with fast_mode: true
Images onlyimage_extractor
Non-PDF inputsThis extractor is PDF-only

Input Schema

FieldTypeRequiredDescription
pdfstringYesURL or path to the PDF file. Supports multi-page PDFs. Populated from input_mappings.
{
  "pdf": "s3://my-bucket/contracts/lease.pdf"
}
Input Examples:
TypeExample
Invoices3://documents/invoices/inv-001.pdf
Contracthttps://cdn.example.com/contracts/lease.pdf
Scanned documents3://archive/scanned/1985-report.pdf
Forms3://forms/application-form.pdf
Supported input types: PDF only (max 1 PDF per object). Max file size 100MB. For scanned documents, 150–300 DPI originals give the best OCR.

Output Schema

One document per detected block:
FieldTypeDescription
page_numberintegerPage number in the original PDF (1-indexed)
object_typeenumparagraph, table, form, list, header, footer, figure, or handwritten
block_indexintegerBlock index within the page (0-indexed)
bboxobjectBounding box: { x0, y0, x1, y1 }
text_rawstringOriginal extracted text
text_correctedstring | nullCleaned / VLM-corrected text
overall_confidencefloatConfidence score 0.0–1.0
confidence_tagenumA (≥0.85), B (≥0.70), C (≥0.50), D (<0.50)
document_graph_extractor_v1_text_embeddingfloat[1024] | nullE5-Large block embedding (when run_text_embedding)
thumbnail_url / segment_thumbnail_urlstring | nullFull-page / per-block thumbnail URLs
total_pagesinteger | nullTotal pages in the source PDF
source_filestring | nullOriginal source file name
{
  "page_number": 1,
  "object_type": "table",
  "block_index": 3,
  "bbox": { "x0": 72.0, "y0": 220.4, "x1": 540.0, "y1": 410.9 },
  "text_raw": "Item  Qty  Price\nWidget  10  $4.00",
  "text_corrected": "Item | Qty | Price\nWidget | 10 | $4.00",
  "overall_confidence": 0.91,
  "confidence_tag": "A"
}

Parameters

Layout Detection

ParameterTypeDefaultRangeDescription
use_layout_detectionbooleantrueEnable ML-based layout detection to find all elements (text, images, tables, figures). When disabled, falls back to text-only extraction (faster, misses images).
layout_detectorstring"pymupdf"pymupdf, doclingEngine used when use_layout_detection=true. pymupdf: fast rule-based (~15 pages/sec). docling: SOTA ML/DiT — better semantic type detection and true table structure (~3–8 sec/doc).
render_dpiinteger15072–300DPI for page rendering (used for VLM correction). 72 fast/lower quality, 150 balanced, 300 high quality/slower.

Spatial Clustering (text-only fallback)

Only used when use_layout_detection=false.
ParameterTypeDefaultRangeDescription
vertical_thresholdfloat15.01.0–100.0Max vertical gap (points) between lines grouped into the same block
horizontal_thresholdfloat50.01.0–200.0Max horizontal distance (points) for overlap/column detection
min_text_lengthinteger201–500Minimum block text length (chars); filters noise/fragments

Confidence & VLM Correction

ParameterTypeDefaultRangeDescription
base_confidencefloat0.850.0–1.0Base confidence score for embedded (native) text
min_confidence_for_vlmfloat0.60.0–1.0Confidence threshold below which VLM correction is triggered (only when use_vlm_correction=true)
use_vlm_correctionbooleantrueEnable VLM correction for low-confidence blocks
fast_modebooleanfalseSkip VLM correction entirely for max throughput (~15 pages/sec). Overrides use_vlm_correction
vlm_providerstring"google"google, openai, anthropicLLM provider for VLM correction
vlm_modelstring"gemini-2.5-flash"Correction model, e.g. gemini-2.5-flash, gpt-4o, claude-3-5-sonnet
llm_api_keystring | nullnullBYOK key for VLM correction. Supports secret references, e.g. {{SECRET.openai_api_key}}. Falls back to Mixpeek’s default keys if unset.

Embedding & Thumbnails

ParameterTypeDefaultRangeDescription
run_text_embeddingbooleantrueGenerate E5-Large (1024-d) text embeddings for block content
generate_thumbnailsbooleantrueGenerate thumbnail images for blocks
thumbnail_modestring"both"full_page, segment, bothWhich thumbnails to render (segment = cropped to the block’s bbox)
thumbnail_dpiinteger7236–150DPI for thumbnail generation. Lower DPI = smaller files.

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "file_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.invoice_id" },
      { "source_path": "metadata.vendor" }
    ],
    "parameters": {}
  }
}

Layout Types

The extractor classifies blocks into these object_type values:
TypeDescriptionExample
paragraphBody text blocksArticle content, descriptions
tableTabular dataFinancial tables, data grids
formForm fields and labelsApplication forms, surveys
listBulleted or numbered listsRequirements, instructions
headerPage headersDocument titles, section headers
footerPage footersPage numbers, disclaimers
figureImages and captionsCharts, diagrams, photos
handwrittenHandwritten textSignatures, annotations

Confidence Tags

Extraction quality is graded with confidence tags (thresholds on overall_confidence):
TagConfidenceDescriptionAction
A≥ 0.85ExcellentUse directly
B≥ 0.70GoodReliable, minor issues
C≥ 0.50FairVerify or trigger VLM correction
D< 0.50PoorNeeds VLM correction

Performance & Costs

MetricValue
Cost5 credits per page + 20 credits per VLM correction
pymupdf throughput~15 pages/sec (rule-based)
docling throughput~3–8 sec/doc (ML/DiT)
Fast mode~15 pages/sec (skips VLM correction)

Vector Index

PropertyValue
Index namedocument_graph_extractor_v1_text_embedding
Dimensions1024
TypeDense
Distance metricCosine
Inference modelmultilingual_e5_large_instruct_v1 (intfloat/multilingual-e5-large-instruct)
NormalizationL2 normalized
The embedding is optional. Set run_text_embedding: false for a layout-only extraction with no vector index.

Limitations

  • PDF only: Accepts a single PDF per object; does not process Word docs, images, or other formats.
  • VLM cost: Each correction adds 20 credits; gate it with min_confidence_for_vlm or disable via fast_mode.
  • Layout-detector tradeoff: docling is more accurate but markedly slower than pymupdf.
  • Memory: Large PDFs (100+ pages) may require increased memory.
  • Language / handwriting: OCR works best with Latin scripts; handwriting detection is experimental and less reliable.
  • External dependency: VLM correction depends on the selected provider’s availability.

Search the Extracted Text

Extracted block text (native or OCR/VLM-corrected) is embedded into the document_graph_extractor_v1_text_embedding index, so you search it with a feature_search stage against mixpeek://document_graph_extractor@v1/intfloat__multilingual_e5_large_instruct (input_mode: "text"). For a ready-to-copy retriever — including confidence and layout-type filtering — see Cookbook → Search OCR Text from Scanned PDFs.