Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Why Visual Document Retrieval
Traditional document search pipelines drop information at every stage:- OCR mangles tables, equations, and low-contrast scans
- Layout parsers miss chart and diagram semantics
- Single-vector embeddings pool a whole page into one 768d vector, losing per-region detail
Late Interaction Scoring
Scoring uses the MaxSim operator from ColBERT:q_i, take the maximum cosine similarity against all document patch vectors d_j, then sum across query tokens. This preserves fine-grained alignment — the word “revenue” can latch onto the exact table cell it refers to, while “Q3 2024” latches onto a different region of the same page — without forcing the model to compress everything into one vector up front.
Patch embeddings are computed once at index time, so query latency stays dominated by the MaxSim sum, not by any model inference.
Storage Tradeoff
Multi-vector payloads are bigger. A single-vector index stores ~3KB per page; a ColPali index stores ~512KB per page (1,024 × 128 × 4 bytes). Production deployments mitigate with:- Binary or scalar quantization — 4–8x shrink with minimal recall loss
- Two-stage retrieval — a cheap pooled single-vector stage filters to top-200, then MaxSim reranks
- Multi-vector HNSW — vector stores that natively support grouped payloads avoid row explosion
Pipeline Overview
1. Create a Bucket
Buckets hold the raw PDFs or page images. Use thedocument blob type so each PDF is automatically split into per-page images downstream.
2. Create a Collection with the Visual Document Extractor
The collection runs each page image through the VLM and stores a multi-vector payload per page.emit_pooled_vector: true also writes a single-vector average of the patch embeddings. That pooled vector powers a cheap first-stage filter in the retriever.
3. Ingest Documents
4. Process the Batch
5. Create a Two-Stage Retriever
Stage 1 uses the pooled vector for a fast ANN shortlist. Stage 2 reranks with full MaxSim late interaction against the patch embeddings.6. Query
document_id, page_number, the MaxSim score, and the per-query-token alignment matrix if you ask for it via return_explain: true. That matrix is useful for UI overlays that highlight the patches each query token attended to.
When to Use ColPali vs Text Search
Use ColPali when
- Pages are visually complex (tables, charts, infographics, equations)
- OCR quality is unreliable (scans, handwriting, multi-column layouts)
- You need cross-lingual retrieval without per-language tuning
- Figures and diagrams carry meaning text alone cannot capture
Use text search when
- Documents are born-digital and text-only
- Storage cost dominates (pooled single-vector is ~170x smaller)
- You need sub-10ms query latency at very high QPS
- Corpus is narrow and a single embedder already hits your recall target
Benchmarks
The ViDoRe leaderboard tracks ColPali variants against OCR+text-embedding baselines across figure-heavy, table-heavy, and infographic documents. As of publication, ColPali-based models lead every category by double-digit nDCG@5 margins — the biggest gap is on infographic-style pages, where OCR pipelines lose the most information.Classify with Taxonomies
Auto-classify documents by visual layout (financial report, slide deck, research paper):auto_doc_type based on visual similarity to reference pages. See Taxonomies.
Discover Clusters
Find visual document patterns across your corpus:Set Up Alerts
Get notified when new documents match specific types:Set Up Webhooks
Monitor document processing:Further Reading
- ColPali: Efficient Document Retrieval with Vision Language Models — the original paper
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT — MaxSim origins
- ViDoRe Benchmark — public leaderboard
- PaliGemma model card — the VLM backbone ColPali fine-tunes

