Visual Document Retrieval with ColPali

ColPali replaces the usual OCR → layout parser → text embedder stack with a single vision-language model that emits one embedding per image patch, then scores queries against documents using ColBERT-style MaxSim late interaction. On the ViDoRe benchmark it beats OCR+BGE pipelines by 15–20 nDCG@5 points on figure- and table-heavy documents.

Why Visual Document Retrieval

Traditional document search pipelines drop information at every stage:

OCR mangles tables, equations, and low-contrast scans
Layout parsers miss chart and diagram semantics
Single-vector embeddings pool a whole page into one 768d vector, losing per-region detail

ColPali (Faysse et al., 2024) skips all three. It ingests each page as an image, runs it through a fine-tuned PaliGemma vision-language model, and emits a grid of ~1,024 patch embeddings of 128 dimensions each. Queries are embedded token-by-token with the same model.

Late Interaction Scoring

Scoring uses the MaxSim operator from ColBERT:

score(q, d) = Σ_i max_j ⟨q_i, d_j⟩

For each query token q_i, take the maximum cosine similarity against all document patch vectors d_j, then sum across query tokens. This preserves fine-grained alignment — the word “revenue” can latch onto the exact table cell it refers to, while “Q3 2024” latches onto a different region of the same page — without forcing the model to compress everything into one vector up front. Patch embeddings are computed once at index time, so query latency stays dominated by the MaxSim sum, not by any model inference.

Storage Tradeoff

Multi-vector payloads are bigger. A single-vector index stores ~3KB per page; a ColPali index stores ~512KB per page (1,024 × 128 × 4 bytes). Production deployments mitigate with:

Binary or scalar quantization — 4–8x shrink with minimal recall loss
Two-stage retrieval — a cheap pooled single-vector stage filters to top-200, then MaxSim reranks
Multi-vector HNSW — vector stores that natively support grouped payloads avoid row explosion

Pipeline Overview

Mixpeek implements ColPali as a feature extractor that writes multi-vector payloads to the vector store, plus a retriever stage that runs MaxSim at query time. The steps below build an end-to-end visual document retrieval pipeline.

1. Create a Bucket

Buckets hold the raw PDFs or page images. Use the document blob type so each PDF is automatically split into per-page images downstream.

POST /v1/buckets
{
  "bucket_name": "visual-documents",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_title": { "type": "text" },
      "doc_type": { "type": "text" }
    }
  }
}

2. Create a Collection with the Visual Document Extractor

The collection runs each page image through the VLM and stores a multi-vector payload per page.

POST /v1/collections
{
  "collection_name": "visual-doc-index",
  "source": { "type": "bucket", "bucket_id": "bkt_visual_documents" },
  "feature_extractor": {
    "feature_extractor_name": "visual_document_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "pages_per_chunk": 1,
      "embedding_dim": 128,
      "patches_per_page": 1024,
      "emit_pooled_vector": true
    },
    "field_passthrough": [
      { "source_path": "document_title" },
      { "source_path": "doc_type" }
    ]
  }
}

emit_pooled_vector: true also writes a single-vector average of the patch embeddings. That pooled vector powers a cheap first-stage filter in the retriever.

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "metadata": {
    "document_title": "Acme Corp 2024 10-K",
    "doc_type": "financial"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/filings/acme-10k-2024.pdf"
    }
  ]
}

4. Process the Batch

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Two-Stage Retriever

Stage 1 uses the pooled vector for a fast ANN shortlist. Stage 2 reranks with full MaxSim late interaction against the patch embeddings.

POST /v1/retrievers
{
  "retriever_name": "visual-doc-retriever",
  "collection_ids": ["col_visual_doc_index"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "doc_type": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.doc_type",
          "operator": "eq",
          "value": "{{inputs.doc_type}}"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://visual_document_extractor@v1/pooled_embedding",
        "input_mapping": { "text": "query" },
        "limit": 200
      }
    },
    {
      "stage_name": "late_interaction_rerank",
      "version": "v1",
      "parameters": {
        "scoring": "maxsim",
        "query_field": "visual_document_extractor@v1/patch_embeddings",
        "limit": 10
      }
    }
  ]
}

6. Query

POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "segment revenue breakdown by region with YoY growth",
    "doc_type": "financial"
  },
  "limit": 10
}

Each result includes the document_id, page_number, the MaxSim score, and the per-query-token alignment matrix if you ask for it via return_explain: true. That matrix is useful for UI overlays that highlight the patches each query token attended to.

When to Use ColPali vs Text Search

Use ColPali when

Pages are visually complex (tables, charts, infographics, equations)
OCR quality is unreliable (scans, handwriting, multi-column layouts)
You need cross-lingual retrieval without per-language tuning
Figures and diagrams carry meaning text alone cannot capture

Use text search when

Documents are born-digital and text-only
Storage cost dominates (pooled single-vector is ~170x smaller)
You need sub-10ms query latency at very high QPS
Corpus is narrow and a single embedder already hits your recall target

Benchmarks

The ViDoRe leaderboard tracks ColPali variants against OCR+text-embedding baselines across figure-heavy, table-heavy, and infographic documents. As of publication, ColPali-based models lead every category by double-digit nDCG@5 margins — the biggest gap is on infographic-style pages, where OCR pipelines lose the most information.

Classify with Taxonomies

Auto-classify documents by visual layout (financial report, slide deck, research paper):

POST /v1/taxonomies
{
  "taxonomy_name": "doc-layout-classifier",
  "taxonomy_type": "flat",
  "retriever_id": "ret_visual_doc_retriever",
  "collection_id": "col_visual_doc_index",
  "input_mappings": [{ "source": "blob.document_url", "target": "query" }],
  "enrichment_fields": [{ "source": "payload.doc_type", "target": "auto_doc_type" }],
  "threshold": 0.7,
  "execution_mode": "materialize"
}

New documents automatically get auto_doc_type based on visual similarity to reference pages. See Taxonomies.

Discover Clusters

Find visual document patterns across your corpus:

POST /v1/clusters
{
  "cluster_name": "document-layouts",
  "collection_id": "col_visual_doc_index",
  "feature_uri": "mixpeek://visual_document_extractor@v1/pooled_embedding",
  "algorithm": { "name": "hdbscan", "params": { "min_cluster_size": 5 } },
  "llm_labeling": { "enabled": true },
  "dimension_reduction": { "method": "umap", "n_components": 2 }
}

Clusters reveal layout patterns like “tables-heavy pages”, “chart pages”, “text-only pages” without predefined categories. Promote to taxonomy nodes once stable. See Clusters.

Set Up Alerts

Get notified when new documents match specific types:

POST /v1/alerts
{
  "alert_name": "new-financial-filings",
  "collection_id": "col_visual_doc_index",
  "condition": { "field": "metadata.doc_type", "operator": "eq", "value": "financial" },
  "notification": { "type": "webhook", "url": "https://example.com/webhook" }
}

Set Up Webhooks

Monitor document processing:

POST /v1/webhooks
{
  "webhook_name": "doc-processing",
  "url": "https://example.com/webhook",
  "events": ["batch.completed", "batch.failed"]
}

Tutorials

Visual Document Retrieval with ColPali

Why Visual Document Retrieval

Late Interaction Scoring

Storage Tradeoff

Pipeline Overview

1. Create a Bucket

2. Create a Collection with the Visual Document Extractor

3. Ingest Documents

4. Process the Batch

5. Create a Two-Stage Retriever

6. Query

When to Use ColPali vs Text Search

Use ColPali when

Use text search when

Benchmarks

Classify with Taxonomies

Discover Clusters

Set Up Alerts

Set Up Webhooks

Further Reading

Tutorials

Documentation Index

​Why Visual Document Retrieval

​Late Interaction Scoring

​Storage Tradeoff

​Pipeline Overview

​1. Create a Bucket

​2. Create a Collection with the Visual Document Extractor

​3. Ingest Documents

​4. Process the Batch

​5. Create a Two-Stage Retriever

​6. Query

​When to Use ColPali vs Text Search

Use ColPali when

Use text search when

​Benchmarks

​Classify with Taxonomies

​Discover Clusters

​Set Up Alerts

​Set Up Webhooks

​Further Reading

Why Visual Document Retrieval

Late Interaction Scoring

Storage Tradeoff

Pipeline Overview

1. Create a Bucket

2. Create a Collection with the Visual Document Extractor

3. Ingest Documents

4. Process the Batch

5. Create a Two-Stage Retriever

6. Query

When to Use ColPali vs Text Search

Benchmarks

Classify with Taxonomies

Discover Clusters

Set Up Alerts

Set Up Webhooks

Further Reading