> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Document Graph Extractor > Extract spatial blocks from PDFs with layout classification, confidence scoring, optional VLM correction, and 1024-d E5 text embeddings Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry. Document graph extractor pipeline showing PDF parsing, layout detection, and block extraction

Document graph extractor pipeline showing PDF parsing, layout detection, and block extraction

The document graph extractor decomposes PDFs into **spatial blocks** — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump. View extractor details at [api.mixpeek.com/v1/collections/features/extractors/document\_graph\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`. ## Pipeline Steps 1. **Layout detection** (if `use_layout_detection`, the default) — find all document elements with `layout_detector` (`pymupdf` fast rule-based, or `docling` SOTA ML/DiT). Detects text regions **and** non-text elements (scanned images, figures, charts) as separate blocks. 2. **Block grouping** — when layout detection is off, group text spans into blocks using `vertical_threshold` / `horizontal_threshold`; drop blocks shorter than `min_text_length`. 3. **Confidence scoring** — assign `base_confidence` to native text (penalties for OCR artifacts / encoding issues), then tag each block A/B/C/D. 4. **VLM correction** (if `use_vlm_correction` and confidence \< `min_confidence_for_vlm`) — re-read low-confidence blocks with `vlm_provider`/`vlm_model`. Skipped entirely in `fast_mode`. 5. **Text embedding** (if `run_text_embedding`) — embed block text with E5-Large (1024-d). 6. **Thumbnails** (if `generate_thumbnails`) — render full-page and/or per-block thumbnails per `thumbnail_mode`. 7. **Output** — one document per block with layout class, bbox, text, confidence, and optional embedding/thumbnails. ## When to Use | Use Case | Description | | ------------------------------ | -------------------------------------------------------------------------------- | | **Archival / scanned PDFs** | OCR + layout recovery with VLM correction for noisy scans and historical records | | **Forms & tables** | Classify and isolate form fields, tables, and structured regions | | **Spatial search** | Retrieve blocks by location and type, not just text | | **Confidence-gated pipelines** | Route low-confidence blocks (tags C/D) to review | | **Multi-layout documents** | Reports, contracts, and multi-column layouts that need block-level granularity | ## When NOT to Use | Scenario | Recommended Alternative | | ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | Plain text you already have, or born-digital PDFs with perfect text | [`text_extractor`](/processing/extractors/text) (faster, simpler) | | Whole-document multimodal embedding | [`multimodal_extractor`](/processing/extractors/multimodal) / [`universal_extractor`](/processing/extractors/universal) | | Maximum throughput, no spatial detail | `text_extractor`, or this extractor with `fast_mode: true` | | Images only | [`image_extractor`](/processing/extractors/image) | | Non-PDF inputs | This extractor is PDF-only | ## Input Schema | Field | Type | Required | Description | | ----- | ------ | -------- | --------------------------------------------------------------------------------------- | | `pdf` | string | **Yes** | URL or path to the PDF file. Supports multi-page PDFs. Populated from `input_mappings`. | ```json theme={null} { "pdf": "s3://my-bucket/contracts/lease.pdf" } ``` **Input Examples:** | Type | Example | | ---------------- | --------------------------------------------- | | Invoice | `s3://documents/invoices/inv-001.pdf` | | Contract | `https://cdn.example.com/contracts/lease.pdf` | | Scanned document | `s3://archive/scanned/1985-report.pdf` | | Form | `s3://forms/application-form.pdf` | Supported input types: **PDF only** (max 1 PDF per object). Max file size 100MB. For scanned documents, 150–300 DPI originals give the best OCR. ## Output Schema One document per detected block: | Field | Type | Description | | -------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------ | | `page_number` | integer | Page number in the original PDF (1-indexed) | | `object_type` | enum | `paragraph`, `table`, `form`, `list`, `header`, `footer`, `figure`, or `handwritten` | | `block_index` | integer | Block index within the page (0-indexed) | | `bbox` | object | Bounding box: `{ x0, y0, x1, y1 }` | | `text_raw` | string | Original extracted text | | `text_corrected` | string \| null | Cleaned / VLM-corrected text | | `overall_confidence` | float | Confidence score 0.0–1.0 | | `confidence_tag` | enum | `A` (`≥0.85`), `B` (`≥0.70`), `C` (`≥0.50`), `D` (`<0.50`) | | `document_graph_extractor_v1_text_embedding` | float\[1024] \| null | E5-Large block embedding (when `run_text_embedding`) | | `thumbnail_url` / `segment_thumbnail_url` | string \| null | Full-page / per-block thumbnail URLs | | `total_pages` | integer \| null | Total pages in the source PDF | | `source_file` | string \| null | Original source file name | ```json theme={null} { "page_number": 1, "object_type": "table", "block_index": 3, "bbox": { "x0": 72.0, "y0": 220.4, "x1": 540.0, "y1": 410.9 }, "text_raw": "Item Qty Price\nWidget 10 $4.00", "text_corrected": "Item | Qty | Price\nWidget | 10 | $4.00", "overall_confidence": 0.91, "confidence_tag": "A" } ``` ## Parameters ### Layout Detection | Parameter | Type | Default | Range | Description | | ---------------------- | ------- | ----------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `use_layout_detection` | boolean | `true` | — | Enable ML-based layout detection to find all elements (text, images, tables, figures). When disabled, falls back to text-only extraction (faster, misses images). | | `layout_detector` | string | `"pymupdf"` | `pymupdf`, `docling` | Engine used when `use_layout_detection=true`. `pymupdf`: fast rule-based (\~15 pages/sec). `docling`: SOTA ML/DiT — better semantic type detection and true table structure (\~3–8 sec/doc). | | `render_dpi` | integer | `150` | 72–300 | DPI for page rendering (used for VLM correction). 72 fast/lower quality, 150 balanced, 300 high quality/slower. | ### Spatial Clustering (text-only fallback) Only used when `use_layout_detection=false`. | Parameter | Type | Default | Range | Description | | ---------------------- | ------- | ------- | --------- | ------------------------------------------------------------------- | | `vertical_threshold` | float | `15.0` | 1.0–100.0 | Max vertical gap (points) between lines grouped into the same block | | `horizontal_threshold` | float | `50.0` | 1.0–200.0 | Max horizontal distance (points) for overlap/column detection | | `min_text_length` | integer | `20` | 1–500 | Minimum block text length (chars); filters noise/fragments | ### Confidence & VLM Correction | Parameter | Type | Default | Range | Description | | ------------------------ | -------------- | -------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | `base_confidence` | float | `0.85` | 0.0–1.0 | Base confidence score for embedded (native) text | | `min_confidence_for_vlm` | float | `0.6` | 0.0–1.0 | Confidence threshold below which VLM correction is triggered (only when `use_vlm_correction=true`) | | `use_vlm_correction` | boolean | `true` | — | Enable VLM correction for low-confidence blocks | | `fast_mode` | boolean | `false` | — | Skip VLM correction entirely for max throughput (\~15 pages/sec). Overrides `use_vlm_correction` | | `vlm_provider` | string | `"google"` | `google`, `openai`, `anthropic` | LLM provider for VLM correction | | `vlm_model` | string | `"gemini-2.5-flash"` | — | Correction model, e.g. `gemini-2.5-flash`, `gpt-4o`, `claude-3-5-sonnet` | | `llm_api_key` | string \| null | `null` | — | BYOK key for VLM correction. Supports secret references, e.g. `{{SECRET.openai_api_key}}`. Falls back to Mixpeek's default keys if unset. | ### Embedding & Thumbnails | Parameter | Type | Default | Range | Description | | --------------------- | ------- | -------- | ------------------------------ | -------------------------------------------------------------------- | | `run_text_embedding` | boolean | `true` | — | Generate E5-Large (1024-d) text embeddings for block content | | `generate_thumbnails` | boolean | `true` | — | Generate thumbnail images for blocks | | `thumbnail_mode` | string | `"both"` | `full_page`, `segment`, `both` | Which thumbnails to render (`segment` = cropped to the block's bbox) | | `thumbnail_dpi` | integer | `72` | 36–150 | DPI for thumbnail generation. Lower DPI = smaller files. | ## Configuration Examples ```json Default (layout + VLM + embeddings) theme={null} { "feature_extractor": { "feature_extractor_name": "document_graph_extractor", "version": "v1", "input_mappings": { "pdf": "file_url" }, "field_passthrough": [ { "source_path": "metadata.invoice_id" }, { "source_path": "metadata.vendor" } ], "parameters": {} } } ``` ```json Fast Mode (max throughput, no VLM) theme={null} { "feature_extractor": { "feature_extractor_name": "document_graph_extractor", "version": "v1", "input_mappings": { "pdf": "file_url" }, "parameters": { "fast_mode": true, "generate_thumbnails": false } } } ``` ```json High-Accuracy Scanned Docs (docling + VLM) theme={null} { "feature_extractor": { "feature_extractor_name": "document_graph_extractor", "version": "v1", "input_mappings": { "pdf": "scanned_doc" }, "parameters": { "layout_detector": "docling", "min_confidence_for_vlm": 0.75, "vlm_provider": "anthropic", "vlm_model": "claude-3-5-sonnet", "render_dpi": 200 } } } ``` ```json Text-Only Fallback (no layout detection) theme={null} { "feature_extractor": { "feature_extractor_name": "document_graph_extractor", "version": "v1", "input_mappings": { "pdf": "pdf_url" }, "parameters": { "use_layout_detection": false, "vertical_threshold": 15.0, "horizontal_threshold": 50.0, "run_text_embedding": true } } } ``` ## Layout Types The extractor classifies blocks into these `object_type` values: | Type | Description | Example | | ------------- | -------------------------- | -------------------------------- | | `paragraph` | Body text blocks | Article content, descriptions | | `table` | Tabular data | Financial tables, data grids | | `form` | Form fields and labels | Application forms, surveys | | `list` | Bulleted or numbered lists | Requirements, instructions | | `header` | Page headers | Document titles, section headers | | `footer` | Page footers | Page numbers, disclaimers | | `figure` | Images and captions | Charts, diagrams, photos | | `handwritten` | Handwritten text | Signatures, annotations | ## Confidence Tags Extraction quality is graded with confidence tags (thresholds on `overall_confidence`): | Tag | Confidence | Description | Action | | ----- | ---------- | ----------- | -------------------------------- | | **A** | ≥ 0.85 | Excellent | Use directly | | **B** | ≥ 0.70 | Good | Reliable, minor issues | | **C** | ≥ 0.50 | Fair | Verify or trigger VLM correction | | **D** | \< 0.50 | Poor | Needs VLM correction | ## Performance & Costs | Metric | Value | | ------------------------ | -------------------------------------------------- | | **Cost** | 5 credits per page + 20 credits per VLM correction | | **`pymupdf` throughput** | \~15 pages/sec (rule-based) | | **`docling` throughput** | \~3–8 sec/doc (ML/DiT) | | **Fast mode** | \~15 pages/sec (skips VLM correction) | ## Vector Index | Property | Value | | ------------------- | ------------------------------------------------------------------------------- | | **Index name** | `document_graph_extractor_v1_text_embedding` | | **Dimensions** | 1024 | | **Type** | Dense | | **Distance metric** | Cosine | | **Inference model** | `multilingual_e5_large_instruct_v1` (`intfloat/multilingual-e5-large-instruct`) | | **Normalization** | L2 normalized | The embedding is optional. Set `run_text_embedding: false` for a layout-only extraction with no vector index. ## Limitations * **PDF only**: Accepts a single PDF per object; does not process Word docs, images, or other formats. * **VLM cost**: Each correction adds 20 credits; gate it with `min_confidence_for_vlm` or disable via `fast_mode`. * **Layout-detector tradeoff**: `docling` is more accurate but markedly slower than `pymupdf`. * **Memory**: Large PDFs (100+ pages) may require increased memory. * **Language / handwriting**: OCR works best with Latin scripts; handwriting detection is experimental and less reliable. * **External dependency**: VLM correction depends on the selected provider's availability. ## Search the Extracted Text Extracted block text (native or OCR/VLM-corrected) is embedded into the `document_graph_extractor_v1_text_embedding` index, so you search it with a [`feature_search`](/retrieval/stages/feature-search) stage against `mixpeek://document_graph_extractor@v1/intfloat__multilingual_e5_large_instruct` (`input_mode: "text"`). For a ready-to-copy retriever — including confidence and layout-type filtering — see [Cookbook → Search OCR Text from Scanned PDFs](/retrieval/cookbook#search-ocr-text-from-scanned-pdfs). ## Related * [Feature Extractors Overview](/processing/feature-extractors) * [Text Extractor](/processing/extractors/text) * [Image Extractor](/processing/extractors/image) * [Multimodal Extractor](/processing/extractors/multimodal) * [Universal Extractor](/processing/extractors/universal) * [Retrieval Cookbook — OCR / scanned-document search](/retrieval/cookbook#search-ocr-text-from-scanned-pdfs)