Skip to main content
ColPali replaces the usual OCR → layout parser → text embedder stack with a single vision-language model that emits one embedding per image patch, then scores queries against documents using ColBERT-style MaxSim late interaction. On the ViDoRe benchmark it beats OCR+BGE pipelines by 15–20 nDCG@5 points on figure- and table-heavy documents.

Why Visual Document Retrieval

Traditional document search pipelines drop information at every stage:
  1. OCR mangles tables, equations, and low-contrast scans
  2. Layout parsers miss chart and diagram semantics
  3. Single-vector embeddings pool a whole page into one 768d vector, losing per-region detail
ColPali (Faysse et al., 2024) skips all three. It ingests each page as an image, runs it through a fine-tuned PaliGemma vision-language model, and emits a grid of ~1,024 patch embeddings of 128 dimensions each. Queries are embedded token-by-token with the same model.

Late Interaction Scoring

Scoring uses the MaxSim operator from ColBERT:
score(q, d) = Σ_i max_j ⟨q_i, d_j⟩
For each query token q_i, take the maximum cosine similarity against all document patch vectors d_j, then sum across query tokens. This preserves fine-grained alignment — the word “revenue” can latch onto the exact table cell it refers to, while “Q3 2024” latches onto a different region of the same page — without forcing the model to compress everything into one vector up front. Patch embeddings are computed once at index time, so query latency stays dominated by the MaxSim sum, not by any model inference.

Storage Tradeoff

Multi-vector payloads are bigger. A single-vector index stores ~3KB per page; a ColPali index stores ~512KB per page (1,024 × 128 × 4 bytes). Production deployments mitigate with:
  • Binary or scalar quantization — 4–8x shrink with minimal recall loss
  • Two-stage retrieval — a cheap pooled single-vector stage filters to top-200, then MaxSim reranks
  • Multi-vector HNSW — vector stores that natively support grouped payloads avoid row explosion

Pipeline Overview

Visual Document Retrieval Pipeline
Mixpeek implements ColPali as a feature extractor that writes multi-vector payloads to the vector store, plus a retriever stage that runs MaxSim at query time. The steps below build an end-to-end visual document retrieval pipeline.

1. Create a Bucket

Buckets hold the raw PDFs or page images. Use the document blob type so each PDF is automatically split into per-page images downstream.
POST /v1/buckets
{
  "bucket_name": "visual-documents",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_title": { "type": "text" },
      "doc_type": { "type": "text" }
    }
  }
}

2. Create a Collection with the Visual Document Extractor

The collection runs each page image through the VLM and stores a multi-vector payload per page.
POST /v1/collections
{
  "collection_name": "visual-doc-index",
  "source": { "type": "bucket", "bucket_id": "bkt_visual_documents" },
  "feature_extractor": {
    "feature_extractor_name": "visual_document_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "pages_per_chunk": 1,
      "embedding_dim": 128,
      "patches_per_page": 1024,
      "emit_pooled_vector": true
    },
    "field_passthrough": [
      { "source_path": "document_title" },
      { "source_path": "doc_type" }
    ]
  }
}
emit_pooled_vector: true also writes a single-vector average of the patch embeddings. That pooled vector powers a cheap first-stage filter in the retriever.

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "metadata": {
    "document_title": "Acme Corp 2024 10-K",
    "doc_type": "financial"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/filings/acme-10k-2024.pdf"
    }
  ]
}

4. Process the Batch

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Two-Stage Retriever

Stage 1 uses the pooled vector for a fast ANN shortlist. Stage 2 reranks with full MaxSim late interaction against the patch embeddings.
POST /v1/retrievers
{
  "retriever_name": "visual-doc-retriever",
  "collection_ids": ["col_visual_doc_index"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "doc_type": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.doc_type",
          "operator": "eq",
          "value": "{{inputs.doc_type}}"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://visual_document_extractor@v1/pooled_embedding",
        "input_mapping": { "text": "query" },
        "limit": 200
      }
    },
    {
      "stage_name": "late_interaction_rerank",
      "version": "v1",
      "parameters": {
        "scoring": "maxsim",
        "query_field": "visual_document_extractor@v1/patch_embeddings",
        "limit": 10
      }
    }
  ]
}

6. Query

POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "segment revenue breakdown by region with YoY growth",
    "doc_type": "financial"
  },
  "limit": 10
}
Each result includes the document_id, page_number, the MaxSim score, and the per-query-token alignment matrix if you ask for it via return_explain: true. That matrix is useful for UI overlays that highlight the patches each query token attended to.

Use ColPali when

  • Pages are visually complex (tables, charts, infographics, equations)
  • OCR quality is unreliable (scans, handwriting, multi-column layouts)
  • You need cross-lingual retrieval without per-language tuning
  • Figures and diagrams carry meaning text alone cannot capture

Use text search when

  • Documents are born-digital and text-only
  • Storage cost dominates (pooled single-vector is ~170x smaller)
  • You need sub-10ms query latency at very high QPS
  • Corpus is narrow and a single embedder already hits your recall target

Benchmarks

The ViDoRe leaderboard tracks ColPali variants against OCR+text-embedding baselines across figure-heavy, table-heavy, and infographic documents. As of publication, ColPali-based models lead every category by double-digit nDCG@5 margins — the biggest gap is on infographic-style pages, where OCR pipelines lose the most information.

Further Reading