A retrieval approach that applies late-interaction matching (inspired by ColBERT) to document page images, enabling search over visually rich documents without requiring OCR or text extraction.

How It Works

ColPali treats each document page as an image and encodes it using a vision-language model into a set of patch-level embedding vectors. Rather than producing a single vector per page, it generates one vector per image patch, preserving fine-grained spatial information about text, tables, figures, and layout. At query time, the text query is encoded into token-level vectors, and a late-interaction scoring function computes similarity by matching each query token against the most relevant document patches. This approach captures both textual content and visual structure without requiring a separate OCR step.

Technical Details

ColPali builds on the ColBERT late-interaction paradigm, replacing the text encoder with a vision-language model (typically based on PaliGemma or similar architectures). Each document page is rendered as an image and processed through the vision encoder to produce patch embeddings. The query is encoded through the language model into token embeddings. Scoring uses MaxSim: for each query token, the maximum similarity to any document patch is computed, and these per-token scores are summed. This enables efficient retrieval through pre-computation of document patch embeddings and approximate nearest-neighbor search at the patch level.

Best Practices

Use ColPali for document collections where visual layout carries important information -- forms, invoices, scientific papers, slides
Render document pages at sufficient resolution (typically 1024px or higher) to preserve text readability for the vision model
Pre-compute and index document patch embeddings offline to keep query-time latency low
Combine ColPali retrieval with a reranker for higher precision on the final result set

Common Pitfalls

Applying ColPali to text-heavy documents where traditional text-based retrieval would be simpler and equally effective
Using low-resolution page images that degrade the vision model's ability to read text content
Not accounting for the storage overhead of patch-level embeddings, which are significantly larger than single-vector representations
Expecting ColPali to perform well on handwritten or heavily degraded documents without fine-tuning

Advanced Tips

Fine-tune ColPali on domain-specific document layouts to improve retrieval accuracy for specialized collections
Use hierarchical retrieval: ColPali for page-level retrieval, then a text-based model for passage-level extraction within retrieved pages
Implement patch-level attention visualization to understand which document regions matched each query term
Consider quantization of patch embeddings to reduce storage requirements while maintaining retrieval quality

Put it to work: search your own files, free

Managed Mixpeek

Put multimodal search to work

Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

Start with Managed

MVS · bring your own

Already have vectors?

Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. From $25/mo.

Start with MVS

Building an agent? Connect Mixpeek over MCP

Related Terms

ACID API Blob Storage CLIP Embedding