> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Graph Extractor

> Extract spatial blocks from PDFs with layout classification, confidence scoring, optional VLM correction, and 1024-d E5 text embeddings

<Card title="View on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors/blob/main/extractors/document_graph_extractor/README.md" horizontal>
  Runnable reference for this extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry.
</Card>

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/extractors/document.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=0984be250546b00316f53538ab0a9d33" alt="Document graph extractor pipeline showing PDF parsing, layout detection, and block extraction" width="1100" height="480" data-path="assets/extractors/document.svg" />
</Frame>

The document graph extractor decomposes PDFs into **spatial blocks** — paragraphs, tables, forms, lists, headers, footers, figures, and handwritten content — each with a bounding box, a layout class, and a confidence score. Low-confidence blocks can be corrected by a vision language model (Gemini, GPT-4V, or Claude). Block text is optionally embedded with E5-Large (1024-d) for semantic search. It is the right tool for archival documents, scanned files, and anything that needs spatial understanding rather than a flat text dump.

<Note>
  View extractor details at [api.mixpeek.com/v1/collections/features/extractors/document\_graph\_extractor\_v1](https://api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1) or fetch programmatically with `GET /v1/collections/features/extractors/{feature_extractor_id}`.
</Note>

## Pipeline Steps

1. **Layout detection** (if `use_layout_detection`, the default) — find all document elements with `layout_detector` (`pymupdf` fast rule-based, or `docling` SOTA ML/DiT). Detects text regions **and** non-text elements (scanned images, figures, charts) as separate blocks.
2. **Block grouping** — when layout detection is off, group text spans into blocks using `vertical_threshold` / `horizontal_threshold`; drop blocks shorter than `min_text_length`.
3. **Confidence scoring** — assign `base_confidence` to native text (penalties for OCR artifacts / encoding issues), then tag each block A/B/C/D.
4. **VLM correction** (if `use_vlm_correction` and confidence \< `min_confidence_for_vlm`) — re-read low-confidence blocks with `vlm_provider`/`vlm_model`. Skipped entirely in `fast_mode`.
5. **Text embedding** (if `run_text_embedding`) — embed block text with E5-Large (1024-d).
6. **Thumbnails** (if `generate_thumbnails`) — render full-page and/or per-block thumbnails per `thumbnail_mode`.
7. **Output** — one document per block with layout class, bbox, text, confidence, and optional embedding/thumbnails.

## When to Use

| Use Case                       | Description                                                                      |
| ------------------------------ | -------------------------------------------------------------------------------- |
| **Archival / scanned PDFs**    | OCR + layout recovery with VLM correction for noisy scans and historical records |
| **Forms & tables**             | Classify and isolate form fields, tables, and structured regions                 |
| **Spatial search**             | Retrieve blocks by location and type, not just text                              |
| **Confidence-gated pipelines** | Route low-confidence blocks (tags C/D) to review                                 |
| **Multi-layout documents**     | Reports, contracts, and multi-column layouts that need block-level granularity   |

## When NOT to Use

| Scenario                                                            | Recommended Alternative                                                                                                 |
| ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| Plain text you already have, or born-digital PDFs with perfect text | [`text_extractor`](/processing/extractors/text) (faster, simpler)                                                       |
| Whole-document multimodal embedding                                 | [`multimodal_extractor`](/processing/extractors/multimodal) / [`universal_extractor`](/processing/extractors/universal) |
| Maximum throughput, no spatial detail                               | `text_extractor`, or this extractor with `fast_mode: true`                                                              |
| Images only                                                         | [`image_extractor`](/processing/extractors/image)                                                                       |
| Non-PDF inputs                                                      | This extractor is PDF-only                                                                                              |

## Input Schema

| Field | Type   | Required | Description                                                                             |
| ----- | ------ | -------- | --------------------------------------------------------------------------------------- |
| `pdf` | string | **Yes**  | URL or path to the PDF file. Supports multi-page PDFs. Populated from `input_mappings`. |

```json theme={null}
{
  "pdf": "s3://my-bucket/contracts/lease.pdf"
}
```

**Input Examples:**

| Type             | Example                                       |
| ---------------- | --------------------------------------------- |
| Invoice          | `s3://documents/invoices/inv-001.pdf`         |
| Contract         | `https://cdn.example.com/contracts/lease.pdf` |
| Scanned document | `s3://archive/scanned/1985-report.pdf`        |
| Form             | `s3://forms/application-form.pdf`             |

Supported input types: **PDF only** (max 1 PDF per object). Max file size 100MB. For scanned documents, 150–300 DPI originals give the best OCR.

## Output Schema

One document per detected block:

| Field                                        | Type                 | Description                                                                          |
| -------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------ |
| `page_number`                                | integer              | Page number in the original PDF (1-indexed)                                          |
| `object_type`                                | enum                 | `paragraph`, `table`, `form`, `list`, `header`, `footer`, `figure`, or `handwritten` |
| `block_index`                                | integer              | Block index within the page (0-indexed)                                              |
| `bbox`                                       | object               | Bounding box: `{ x0, y0, x1, y1 }`                                                   |
| `text_raw`                                   | string               | Original extracted text                                                              |
| `text_corrected`                             | string \| null       | Cleaned / VLM-corrected text                                                         |
| `overall_confidence`                         | float                | Confidence score 0.0–1.0                                                             |
| `confidence_tag`                             | enum                 | `A` (`≥0.85`), `B` (`≥0.70`), `C` (`≥0.50`), `D` (`<0.50`)                           |
| `document_graph_extractor_v1_text_embedding` | float\[1024] \| null | E5-Large block embedding (when `run_text_embedding`)                                 |
| `thumbnail_url` / `segment_thumbnail_url`    | string \| null       | Full-page / per-block thumbnail URLs                                                 |
| `total_pages`                                | integer \| null      | Total pages in the source PDF                                                        |
| `source_file`                                | string \| null       | Original source file name                                                            |

```json theme={null}
{
  "page_number": 1,
  "object_type": "table",
  "block_index": 3,
  "bbox": { "x0": 72.0, "y0": 220.4, "x1": 540.0, "y1": 410.9 },
  "text_raw": "Item  Qty  Price\nWidget  10  $4.00",
  "text_corrected": "Item | Qty | Price\nWidget | 10 | $4.00",
  "overall_confidence": 0.91,
  "confidence_tag": "A"
}
```

## Parameters

### Layout Detection

| Parameter              | Type    | Default     | Range                | Description                                                                                                                                                                                  |
| ---------------------- | ------- | ----------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `use_layout_detection` | boolean | `true`      | —                    | Enable ML-based layout detection to find all elements (text, images, tables, figures). When disabled, falls back to text-only extraction (faster, misses images).                            |
| `layout_detector`      | string  | `"pymupdf"` | `pymupdf`, `docling` | Engine used when `use_layout_detection=true`. `pymupdf`: fast rule-based (\~15 pages/sec). `docling`: SOTA ML/DiT — better semantic type detection and true table structure (\~3–8 sec/doc). |
| `render_dpi`           | integer | `150`       | 72–300               | DPI for page rendering (used for VLM correction). 72 fast/lower quality, 150 balanced, 300 high quality/slower.                                                                              |

### Spatial Clustering (text-only fallback)

Only used when `use_layout_detection=false`.

| Parameter              | Type    | Default | Range     | Description                                                         |
| ---------------------- | ------- | ------- | --------- | ------------------------------------------------------------------- |
| `vertical_threshold`   | float   | `15.0`  | 1.0–100.0 | Max vertical gap (points) between lines grouped into the same block |
| `horizontal_threshold` | float   | `50.0`  | 1.0–200.0 | Max horizontal distance (points) for overlap/column detection       |
| `min_text_length`      | integer | `20`    | 1–500     | Minimum block text length (chars); filters noise/fragments          |

### Confidence & VLM Correction

| Parameter                | Type           | Default              | Range                           | Description                                                                                                                               |
| ------------------------ | -------------- | -------------------- | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `base_confidence`        | float          | `0.85`               | 0.0–1.0                         | Base confidence score for embedded (native) text                                                                                          |
| `min_confidence_for_vlm` | float          | `0.6`                | 0.0–1.0                         | Confidence threshold below which VLM correction is triggered (only when `use_vlm_correction=true`)                                        |
| `use_vlm_correction`     | boolean        | `true`               | —                               | Enable VLM correction for low-confidence blocks                                                                                           |
| `fast_mode`              | boolean        | `false`              | —                               | Skip VLM correction entirely for max throughput (\~15 pages/sec). Overrides `use_vlm_correction`                                          |
| `vlm_provider`           | string         | `"google"`           | `google`, `openai`, `anthropic` | LLM provider for VLM correction                                                                                                           |
| `vlm_model`              | string         | `"gemini-2.5-flash"` | —                               | Correction model, e.g. `gemini-2.5-flash`, `gpt-4o`, `claude-3-5-sonnet`                                                                  |
| `llm_api_key`            | string \| null | `null`               | —                               | BYOK key for VLM correction. Supports secret references, e.g. `{{SECRET.openai_api_key}}`. Falls back to Mixpeek's default keys if unset. |

### Embedding & Thumbnails

| Parameter             | Type    | Default  | Range                          | Description                                                          |
| --------------------- | ------- | -------- | ------------------------------ | -------------------------------------------------------------------- |
| `run_text_embedding`  | boolean | `true`   | —                              | Generate E5-Large (1024-d) text embeddings for block content         |
| `generate_thumbnails` | boolean | `true`   | —                              | Generate thumbnail images for blocks                                 |
| `thumbnail_mode`      | string  | `"both"` | `full_page`, `segment`, `both` | Which thumbnails to render (`segment` = cropped to the block's bbox) |
| `thumbnail_dpi`       | integer | `72`     | 36–150                         | DPI for thumbnail generation. Lower DPI = smaller files.             |

## Configuration Examples

<CodeGroup>
  ```json Default (layout + VLM + embeddings) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "document_graph_extractor",
      "version": "v1",
      "input_mappings": {
        "pdf": "file_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.invoice_id" },
        { "source_path": "metadata.vendor" }
      ],
      "parameters": {}
    }
  }
  ```

  ```json Fast Mode (max throughput, no VLM) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "document_graph_extractor",
      "version": "v1",
      "input_mappings": {
        "pdf": "file_url"
      },
      "parameters": {
        "fast_mode": true,
        "generate_thumbnails": false
      }
    }
  }
  ```

  ```json High-Accuracy Scanned Docs (docling + VLM) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "document_graph_extractor",
      "version": "v1",
      "input_mappings": {
        "pdf": "scanned_doc"
      },
      "parameters": {
        "layout_detector": "docling",
        "min_confidence_for_vlm": 0.75,
        "vlm_provider": "anthropic",
        "vlm_model": "claude-3-5-sonnet",
        "render_dpi": 200
      }
    }
  }
  ```

  ```json Text-Only Fallback (no layout detection) theme={null}
  {
    "feature_extractor": {
      "feature_extractor_name": "document_graph_extractor",
      "version": "v1",
      "input_mappings": {
        "pdf": "pdf_url"
      },
      "parameters": {
        "use_layout_detection": false,
        "vertical_threshold": 15.0,
        "horizontal_threshold": 50.0,
        "run_text_embedding": true
      }
    }
  }
  ```
</CodeGroup>

## Layout Types

The extractor classifies blocks into these `object_type` values:

| Type          | Description                | Example                          |
| ------------- | -------------------------- | -------------------------------- |
| `paragraph`   | Body text blocks           | Article content, descriptions    |
| `table`       | Tabular data               | Financial tables, data grids     |
| `form`        | Form fields and labels     | Application forms, surveys       |
| `list`        | Bulleted or numbered lists | Requirements, instructions       |
| `header`      | Page headers               | Document titles, section headers |
| `footer`      | Page footers               | Page numbers, disclaimers        |
| `figure`      | Images and captions        | Charts, diagrams, photos         |
| `handwritten` | Handwritten text           | Signatures, annotations          |

## Confidence Tags

Extraction quality is graded with confidence tags (thresholds on `overall_confidence`):

| Tag   | Confidence | Description | Action                           |
| ----- | ---------- | ----------- | -------------------------------- |
| **A** | ≥ 0.85     | Excellent   | Use directly                     |
| **B** | ≥ 0.70     | Good        | Reliable, minor issues           |
| **C** | ≥ 0.50     | Fair        | Verify or trigger VLM correction |
| **D** | \< 0.50    | Poor        | Needs VLM correction             |

## Performance & Costs

| Metric                   | Value                                              |
| ------------------------ | -------------------------------------------------- |
| **Cost**                 | 5 credits per page + 20 credits per VLM correction |
| **`pymupdf` throughput** | \~15 pages/sec (rule-based)                        |
| **`docling` throughput** | \~3–8 sec/doc (ML/DiT)                             |
| **Fast mode**            | \~15 pages/sec (skips VLM correction)              |

## Vector Index

| Property            | Value                                                                           |
| ------------------- | ------------------------------------------------------------------------------- |
| **Index name**      | `document_graph_extractor_v1_text_embedding`                                    |
| **Dimensions**      | 1024                                                                            |
| **Type**            | Dense                                                                           |
| **Distance metric** | Cosine                                                                          |
| **Inference model** | `multilingual_e5_large_instruct_v1` (`intfloat/multilingual-e5-large-instruct`) |
| **Normalization**   | L2 normalized                                                                   |

<Info>
  The embedding is optional. Set `run_text_embedding: false` for a layout-only extraction with no vector index.
</Info>

## Limitations

* **PDF only**: Accepts a single PDF per object; does not process Word docs, images, or other formats.
* **VLM cost**: Each correction adds 20 credits; gate it with `min_confidence_for_vlm` or disable via `fast_mode`.
* **Layout-detector tradeoff**: `docling` is more accurate but markedly slower than `pymupdf`.
* **Memory**: Large PDFs (100+ pages) may require increased memory.
* **Language / handwriting**: OCR works best with Latin scripts; handwriting detection is experimental and less reliable.
* **External dependency**: VLM correction depends on the selected provider's availability.

## Search the Extracted Text

Extracted block text (native or OCR/VLM-corrected) is embedded into the `document_graph_extractor_v1_text_embedding` index, so you search it with a [`feature_search`](/retrieval/stages/feature-search) stage against `mixpeek://document_graph_extractor@v1/intfloat__multilingual_e5_large_instruct` (`input_mode: "text"`). For a ready-to-copy retriever — including confidence and layout-type filtering — see [Cookbook → Search OCR Text from Scanned PDFs](/retrieval/cookbook#search-ocr-text-from-scanned-pdfs).

## Related

* [Feature Extractors Overview](/processing/feature-extractors)
* [Text Extractor](/processing/extractors/text)
* [Image Extractor](/processing/extractors/image)
* [Multimodal Extractor](/processing/extractors/multimodal)
* [Universal Extractor](/processing/extractors/universal)
* [Retrieval Cookbook — OCR / scanned-document search](/retrieval/cookbook#search-ocr-text-from-scanned-pdfs)
