Skip to main content
Visual document retrieval treats each page as an image and embeds it with a cross-modal model, so a text query can find the right page by what it looks like — tables, charts, diagrams, scans — without relying on OCR.

Why Visual Document Retrieval

Traditional document search pipelines drop information at every stage:
  1. OCR mangles tables, equations, and low-contrast scans.
  2. Layout parsers miss chart and diagram semantics.
  3. Text-only embeddings never see the visual structure of the page.
Mixpeek’s multimodal_extractor embeds page images directly with Google’s Vertex multimodal model into a shared text-image space (vertex_multimodal_embedding). Because the space is cross-modal, a text query retrieves visually-relevant pages — the words “revenue breakdown by region” can match a page dominated by a financial table, even with no clean extractable text.
This is single-vector cross-modal retrieval (one embedding per page). Mixpeek does not currently offer ColPali-style multi-vector late-interaction (per-patch MaxSim) scoring. For born-digital, text-heavy PDFs where you want extracted text + OCR, use the universal extractor instead (see Document Intelligence).

1. Create a bucket

Hold the page images. Render each PDF page to an image (PNG/JPG) and store the page image URL — the cross-modal model embeds images.
curl -sS -X POST "$MP_API_URL/v1/buckets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "visual-documents",
    "bucket_schema": {
      "properties": {
        "page_image": { "type": "image" },
        "document_title": { "type": "text" },
        "doc_type": { "type": "string" },
        "page_number": { "type": "integer" }
      }
    }
  }'
Render PDF pages to images client-side (e.g. pdftoppm, pdf2image) — one object per page — so each page becomes an independently retrievable result with its own page_number.

2. Create a collection

Embed each page image with multimodal_extractor. Map the extractor’s image input to your page_image field.
curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "visual-doc-index",
    "source": { "type": "bucket", "bucket_ids": ["bkt_visual_documents"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "image": "page_image" },
      "field_passthrough": [
        { "source_path": "document_title" },
        { "source_path": "doc_type" },
        { "source_path": "page_number" }
      ]
    }
  }'

3. Ingest pages

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_visual_documents/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "blobs": [
      { "property": "page_image", "type": "image", "data": "s3://my-bucket/filings/acme-10k-2024/page-042.png" },
      { "property": "document_title", "type": "text", "data": "Acme Corp 2024 10-K" },
      { "property": "doc_type", "type": "text", "data": "financial" },
      { "property": "page_number", "type": "text", "data": "42" }
    ]
  }'

4. Process

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_visual_documents/batches" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_001"] }'

curl -sS -X POST "$MP_API_URL/v1/buckets/bkt_visual_documents/batches/{batch_id}/submit" \
  -H "Authorization: Bearer $MP_API_KEY" -H "X-Namespace: $MP_NAMESPACE"

5. Create a retriever

Search the cross-modal page embeddings with a text query.
curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "visual-doc-retriever",
    "collection_identifiers": ["visual-doc-index"],
    "input_schema": { "query": { "type": "text", "required": true } },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 50
              }
            ],
            "final_top_k": 10
          }
        }
      }
    ]
  }'

6. Query

curl -sS -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "segment revenue breakdown by region with year-over-year growth" },
    "filters": { "field": "doc_type", "operator": "eq", "value": "financial" }
  }'
Each result is a page, ranked by cross-modal similarity, carrying its document_title and page_number so you can deep-link to the exact page.

When to use visual vs text retrieval

Use visual retrieval when

  • Pages are visually complex (tables, charts, infographics, equations)
  • OCR quality is unreliable (scans, handwriting, multi-column layouts)
  • Figures and diagrams carry meaning text alone cannot capture

Use text retrieval when

Next steps

Classify by layout

Auto-classify pages (financial report, slide deck, research paper) with a taxonomy.

Discover layout patterns

Cluster page embeddings to surface visual document patterns.

Combine with text

Pair visual retrieval with extracted-text search for hybrid document QA.

Get notified

Alert when new documents of a given type are indexed.

Further reading