Search PDFs, scanned documents, and figure-heavy reports using ColPali patch embeddings and late interaction scoring — no OCR required
ColPali replaces the usual OCR → layout parser → text embedder stack with a single vision-language model that emits one embedding per image patch, then scores queries against documents using ColBERT-style MaxSim late interaction. On the ViDoRe benchmark it beats OCR+BGE pipelines by 15–20 nDCG@5 points on figure- and table-heavy documents.
Traditional document search pipelines drop information at every stage:
OCR mangles tables, equations, and low-contrast scans
Layout parsers miss chart and diagram semantics
Single-vector embeddings pool a whole page into one 768d vector, losing per-region detail
ColPali (Faysse et al., 2024) skips all three. It ingests each page as an image, runs it through a fine-tuned PaliGemma vision-language model, and emits a grid of ~1,024 patch embeddings of 128 dimensions each. Queries are embedded token-by-token with the same model.
For each query token q_i, take the maximum cosine similarity against all document patch vectors d_j, then sum across query tokens. This preserves fine-grained alignment — the word “revenue” can latch onto the exact table cell it refers to, while “Q3 2024” latches onto a different region of the same page — without forcing the model to compress everything into one vector up front.Patch embeddings are computed once at index time, so query latency stays dominated by the MaxSim sum, not by any model inference.
Multi-vector payloads are bigger. A single-vector index stores ~3KB per page; a ColPali index stores ~512KB per page (1,024 × 128 × 4 bytes). Production deployments mitigate with:
Binary or scalar quantization — 4–8x shrink with minimal recall loss
Two-stage retrieval — a cheap pooled single-vector stage filters to top-200, then MaxSim reranks
Multi-vector HNSW — vector stores that natively support grouped payloads avoid row explosion
Mixpeek implements ColPali as a feature extractor that writes multi-vector payloads to the vector store, plus a retriever stage that runs MaxSim at query time. The steps below build an end-to-end visual document retrieval pipeline.
emit_pooled_vector: true also writes a single-vector average of the patch embeddings. That pooled vector powers a cheap first-stage filter in the retriever.
POST /v1/retrievers/{retriever_id}/execute{ "inputs": { "query": "segment revenue breakdown by region with YoY growth", "doc_type": "financial" }, "limit": 10}
Each result includes the document_id, page_number, the MaxSim score, and the per-query-token alignment matrix if you ask for it via return_explain: true. That matrix is useful for UI overlays that highlight the patches each query token attended to.
The ViDoRe leaderboard tracks ColPali variants against OCR+text-embedding baselines across figure-heavy, table-heavy, and infographic documents. As of publication, ColPali-based models lead every category by double-digit nDCG@5 margins — the biggest gap is on infographic-style pages, where OCR pipelines lose the most information.