Skip to main content
The document graph extractor decomposes PDFs into spatial blocks — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Layout detection (if use_layout_detection) — find all document elements with layout_detector (pymupdf fast rule-based, or docling SOTA ML/DiT).
  2. Block grouping — group lines into blocks using vertical_threshold / horizontal_threshold; drop blocks shorter than min_text_length.
  3. Confidence scoring — assign base_confidence to native text; tag each block A/B/C/D.
  4. VLM correction (if use_vlm_correction and below min_confidence_for_vlm) — re-read low-confidence blocks with vlm_provider/vlm_model. Skipped entirely in fast_mode.
  5. Text embedding (if run_text_embedding) — embed block text with E5-Large (1024-d).
  6. Thumbnails (if generate_thumbnails) — render full-page and/or per-block thumbnails per thumbnail_mode.
  7. Output — one document per block with layout class, bbox, text, confidence, and optional embedding/thumbnails.

When to Use

Use CaseDescription
Archival/scanned PDFsOCR + layout recovery with VLM correction for noisy scans
Forms & tablesClassify and isolate form fields, tables, and structured regions
Spatial searchRetrieve blocks by location and type, not just text
Confidence-gated pipelinesRoute low-confidence blocks (tags C/D) to review

When NOT to Use

ScenarioRecommended Alternative
Plain text you already havetext_extractor
Whole-document multimodal embeddingmultimodal_extractor / universal_extractor
Maximum throughput, no spatial detailtext_extractor, or this extractor with fast_mode: true
Non-PDF inputsThis extractor is PDF-only

Input Schema

FieldTypeRequiredDescription
pdfstringYesURL or path to the PDF file. Populated from input_mappings.
{
  "pdf": "s3://my-bucket/contracts/lease.pdf"
}
Supported input types: PDF (max 1 PDF per object).

Output Schema

One document per detected block:
FieldTypeDescription
page_numberintegerPage number in the original PDF (1-indexed)
object_typeenumparagraph, table, form, list, header, footer, figure, or handwritten
block_indexintegerBlock index within the page (0-indexed)
bboxobjectBounding box: { x0, y0, x1, y1 }
text_rawstringOriginal extracted text
text_correctedstring | nullCleaned / VLM-corrected text
overall_confidencefloatConfidence score 0.0–1.0
confidence_tagenumA (≥0.85), B (≥0.70), C (≥0.50), D (<0.50)
document_graph_extractor_v1_text_embeddingfloat[1024] | nullE5-Large block embedding (when run_text_embedding)
thumbnail_url / segment_thumbnail_urlstring | nullFull-page / per-block thumbnail URLs
total_pagesinteger | nullTotal pages in the source PDF
source_filestring | nullOriginal source file name
{
  "page_number": 1,
  "object_type": "table",
  "block_index": 3,
  "bbox": { "x0": 72.0, "y0": 220.4, "x1": 540.0, "y1": 410.9 },
  "text_raw": "Item  Qty  Price\nWidget  10  $4.00",
  "text_corrected": "Item | Qty | Price\nWidget | 10 | $4.00",
  "overall_confidence": 0.91,
  "confidence_tag": "A"
}

Parameters

Layout Detection

ParameterTypeDefaultRangeDescription
use_layout_detectionbooleantrueEnable ML-based layout detection to find all elements (text, images, tables, figures)
layout_detectorstring"pymupdf"pymupdf, doclingEngine. pymupdf: fast rule-based (~15 pages/sec). docling: SOTA ML/DiT (~3–8 sec/doc)
vertical_thresholdfloat15.01.0–100.0Max vertical gap (points) between lines grouped into the same block
horizontal_thresholdfloat50.01.0–200.0Max horizontal distance (points) for overlap detection
min_text_lengthinteger201–500Minimum block text length (chars); filters noise/fragments

Confidence & VLM Correction

ParameterTypeDefaultRangeDescription
base_confidencefloat0.850.0–1.0Base confidence score for embedded (native) text
min_confidence_for_vlmfloat0.60.0–1.0Threshold below which VLM correction is triggered
use_vlm_correctionbooleantrueEnable VLM correction for low-confidence blocks
fast_modebooleanfalseSkip VLM correction for max throughput (~15 pages/sec). Overrides use_vlm_correction
vlm_providerstring"google"LLM provider: google, openai, anthropic
vlm_modelstring"gemini-2.5-flash"Correction model, e.g. gemini-2.5-flash, gpt-4o, claude-3-5-sonnet
llm_api_keystring | nullnullBYOK key for VLM correction. Supports {{SECRET.openai_api_key}} references

Embedding & Rendering

ParameterTypeDefaultRangeDescription
run_text_embeddingbooleantrueGenerate E5-Large (1024-d) text embeddings for block content
render_dpiinteger15072–300DPI for page rendering (used for VLM correction)
generate_thumbnailsbooleantrueGenerate thumbnail images for blocks
thumbnail_modestring"both"full_page, segment, bothWhich thumbnails to render
thumbnail_dpiinteger7236–150DPI for thumbnail generation

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "file_url"
    },
    "parameters": {}
  }
}

Performance & Costs

MetricValue
Cost5 credits per page + 20 credits per VLM correction
pymupdf throughput~15 pages/sec (rule-based)
docling throughput~3–8 sec/doc (ML/DiT)
Fast mode~15 pages/sec (skips VLM correction)

Vector Index

PropertyValue
Index namedocument_graph_extractor_v1_text_embedding
Dimensions1024
TypeDense
Distance metricCosine
Inference modelintfloat/multilingual-e5-large-instruct
NormalizationL2 normalized
The embedding is optional. Set run_text_embedding: false for a layout-only extraction with no vector index.

Limitations

  • PDF only: Accepts a single PDF per object.
  • VLM cost: Each correction adds 20 credits; gate it with min_confidence_for_vlm or disable via fast_mode.
  • Layout-detector tradeoff: docling is more accurate but markedly slower than pymupdf.
  • External dependency: VLM correction depends on the selected provider’s availability.