> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Visual Document Retrieval

> Search PDFs, scanned pages, and figure-heavy reports by visual content using cross-modal embeddings — no OCR required

<Tip>Visual document retrieval treats each page as an image and embeds it with a cross-modal model, so a text query can find the right page by what it *looks like* — tables, charts, diagrams, scans — without relying on OCR.</Tip>

## Why Visual Document Retrieval

Traditional document search pipelines drop information at every stage:

1. **OCR** mangles tables, equations, and low-contrast scans.
2. **Layout parsers** miss chart and diagram semantics.
3. **Text-only embeddings** never see the visual structure of the page.

Mixpeek's `multimodal_extractor` embeds page images directly with Google's Vertex multimodal model into a shared text-image space (`vertex_multimodal_embedding`). Because the space is cross-modal, a **text query retrieves visually-relevant pages** — the words "revenue breakdown by region" can match a page dominated by a financial table, even with no clean extractable text.

<Note>
  This is single-vector cross-modal retrieval (one embedding per page). Mixpeek does not currently offer ColPali-style multi-vector *late-interaction* (per-patch MaxSim) scoring. For born-digital, text-heavy PDFs where you want extracted text + OCR, use the [universal extractor](/processing/extractors/universal) instead (see [Document Intelligence](/tutorials/document-intelligence)).
</Note>

## 1. Create a bucket

Hold the page images. Render each PDF page to an image (PNG/JPG) and store the page image URL — the cross-modal model embeds images.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "visual-documents",
    "bucket_schema": {
      "properties": {
        "page_image": { "type": "image" },
        "document_title": { "type": "text" },
        "doc_type": { "type": "string" },
        "page_number": { "type": "integer" }
      }
    }
  }'
```

<Tip>
  Render PDF pages to images client-side (e.g. `pdftoppm`, `pdf2image`) — one object per page — so each page becomes an independently retrievable result with its own `page_number`.
</Tip>

## 2. Create a collection

Embed each page image with `multimodal_extractor`. Map the extractor's `image` input to your `page_image` field.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "visual-doc-index",
    "source": { "type": "bucket", "bucket_ids": ["bkt_visual_documents"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "image": "page_image" },
      "field_passthrough": [
        { "source_path": "document_title" },
        { "source_path": "doc_type" },
        { "source_path": "page_number" }
      ]
    }
  }'
```

## 3. Ingest pages

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_visual_documents/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "blobs": [
      { "property": "page_image", "type": "image", "data": "s3://my-bucket/filings/acme-10k-2024/page-042.png" },
      { "property": "document_title", "type": "text", "data": "Acme Corp 2024 10-K" },
      { "property": "doc_type", "type": "text", "data": "financial" },
      { "property": "page_number", "type": "text", "data": "42" }
    ]
  }'
```

## 4. Process

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_visual_documents/batches" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_001"] }'

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_visual_documents/batches/{batch_id}/submit" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"
```

## 5. Create a retriever

Search the cross-modal page embeddings with a text query.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "visual-doc-retriever",
    "collection_identifiers": ["visual-doc-index"],
    "input_schema": { "query": { "type": "text", "required": true } },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 50
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'
```

## 6. Query

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "segment revenue breakdown by region with year-over-year growth" },
    "filters": { "field": "doc_type", "operator": "eq", "value": "financial" }
  }'
```

Each result is a page, ranked by cross-modal similarity, carrying its `document_title` and `page_number` so you can deep-link to the exact page.

## When to use visual vs text retrieval

<CardGroup cols={2}>
  <Card title="Use visual retrieval when" icon="image">
    * Pages are visually complex (tables, charts, infographics, equations)
    * OCR quality is unreliable (scans, handwriting, multi-column layouts)
    * Figures and diagrams carry meaning text alone cannot capture
  </Card>

  <Card title="Use text retrieval when" icon="font">
    * Documents are born-digital and text-only (use [universal extractor](/processing/extractors/universal))
    * You need extracted text, NER, or summaries (see [Document Intelligence](/tutorials/document-intelligence))
    * You want exact-keyword matching (add a [lexical/BM25 search](/retrieval/stages/feature-search#lexical-bm25-search))
  </Card>
</CardGroup>

## Next steps

<CardGroup cols={2}>
  <Card title="Classify by layout" icon="tags" href="/enrichment/taxonomies">
    Auto-classify pages (financial report, slide deck, research paper) with a taxonomy.
  </Card>

  <Card title="Discover layout patterns" icon="diagram-project" href="/enrichment/clusters">
    Cluster page embeddings to surface visual document patterns.
  </Card>

  <Card title="Combine with text" icon="layer-group" href="/tutorials/document-intelligence">
    Pair visual retrieval with extracted-text search for hybrid document QA.
  </Card>

  <Card title="Get notified" icon="bell" href="/enrichment/alerts">
    Alert when new documents of a given type are indexed.
  </Card>
</CardGroup>

## Further reading

* [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) — the late-interaction approach that inspired visual document retrieval
* [ViDoRe Benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard) — visual document retrieval leaderboard
