The OCR Pipeline Is Breaking
For twenty years, document search has followed the same recipe: extract text with OCR, detect layout regions, chunk the text, embed each chunk, store it in a vector database, and run similarity search at query time. This pipeline works well for clean, text-heavy documents. It falls apart the moment your documents contain tables, charts, diagrams, mixed layouts, handwritten annotations, or multilingual content.
The failure modes are predictable:
The traditional pipeline has another structural problem: it is a cascade of independent components. Each stage can only pass limited information to the next. The OCR model does not know what the retrieval model needs. The layout detector does not know which regions matter for the user's query. Errors compound through the chain.
What Visual Document Retrieval Actually Is
Visual document retrieval treats each document page as an image. Instead of extracting text and then embedding it, you embed the page image directly using a vision-language model. The retriever searches over these visual embeddings, matching queries to pages based on what the model *sees* — text, tables, charts, layout, and all.
This is not a small architectural tweak. It eliminates the entire OCR-layout-chunking chain. One model replaces four or five pipeline stages.
The key insight: modern vision-language models already understand text in images. When you show a VLM a photograph of a document page, it can read the text, understand the table structure, interpret the chart, and grasp the spatial relationships between elements. All of this information is encoded into the embedding. The retriever inherits this understanding for free.
Late Interaction: The Scoring Mechanism
The most effective visual document retrieval systems use late interaction scoring, an approach pioneered by ColBERT for text retrieval and extended to vision by ColPali.
How Late Interaction Works
In a standard bi-encoder retriever, each document and each query are compressed into a single vector. Relevance is computed as a dot product between two vectors. This is fast but lossy — the entire document's meaning must fit into 768 or 1024 dimensions.
Late interaction keeps *multiple* vectors per document. Each image patch (a 16x16 or 14x14 pixel region) produces its own vector. A document page might generate 1024 patch vectors. The query also produces multiple token vectors.
Scoring uses MaxSim (Maximum Similarity):
score(query, document) = Σ max sim(q_i, d_j)
i j∈doc
For each query token vector `q_i`, find the document patch vector `d_j` with the highest similarity. Sum these maximum similarities across all query tokens.
Why this works better than single-vector matching:
1. Fine-grained matching. The query "Q3 revenue" can match the specific patch containing "Q3" and the nearby patch containing the revenue number. A single-vector model must compress the entire page into one point, losing this spatial precision. 2. No information bottleneck. With 1024 patch vectors, the model retains far more information about the page than a single 768-dim vector ever could. 3. Layout awareness. Patch positions encode spatial relationships. The model knows that a number is *in* a specific table cell, not just that the number exists somewhere on the page.
The Efficiency Trade-off
Late interaction is more expensive than single-vector retrieval. Storing 1024 vectors per page instead of 1 means 1024x more storage. MaxSim scoring requires comparing every query token against every document patch.
In practice, this is managed with a two-stage pipeline: a fast first stage (BM25 or single-vector ANN search) retrieves candidate pages, and late interaction rescores the top-k candidates. This brings latency back to acceptable levels while preserving the accuracy gains.
ColPali: The Architecture That Started It All
ColPali (2024) was the first model to demonstrate that vision-language models could directly replace OCR pipelines for document retrieval. The architecture is straightforward:
1. Vision encoder: PaliGemma (a 3B VLM based on SigLIP + Gemma) processes the document page image. 2. Patch embeddings: The vision encoder produces one embedding per image patch. A 448x448 image with 14x14 patches yields 1024 patch vectors. 3. Projection layer: A linear layer maps patch embeddings to 128 dimensions for efficient storage and scoring. 4. Late interaction scoring: Queries are tokenized and embedded with the same model's text encoder. MaxSim scores the query tokens against document patches.
ColPali's key result: it matched or beat OCR-based retrieval pipelines on the ViDoRe (Visual Document Retrieval) benchmark, while being dramatically simpler. No OCR. No layout detection. No chunking. No text extraction at all.
What ColPali Sees
When ColPali processes a page, the patch embeddings capture everything visible:
A query like "comparison of model accuracy on ImageNet" will match patches containing accuracy tables, bar charts showing model comparisons, or even figure captions mentioning ImageNet — all without any text extraction step.
ColQwen: Scaling Up
ColQwen2.5 (from the Vidore team) replaces PaliGemma with Qwen2-VL as the backbone, bringing several improvements:
ColQwen2.5-v0.2 currently leads the ViDoRe v2 benchmark, outperforming ColPali v1.3 and several OCR-based baselines.
Production Architecture: Building a Visual Document Retrieval Pipeline
A production system typically combines visual document retrieval with traditional methods in a hybrid pipeline:
Stage 1: Indexing
For each document:
1. Render each page as an image (PDF → PNG at 300 DPI)
2. Run the VLM encoder to produce patch embeddings per page
3. Store patch embeddings in a multi-vector index
4. Optionally: also run OCR and store text chunks for hybrid search
Stage 2: Retrieval
For each query:
1. Fast candidate retrieval (BM25 on OCR text, or single-vector ANN)
2. Late interaction rescoring on top-k candidates (MaxSim over patches)
3. Optional: cross-encoder reranking on final top-n
4. Return ranked pages with bounding box highlights
Stage 3: Post-Retrieval
Once you have the relevant pages, a generative VLM can answer questions directly from the page image — no text extraction needed. The user asks a question, the retriever finds the right page, and the VLM reads the answer from the image.
Index Size and Latency
Practical numbers for planning:
| Metric | Single-vector | Late interaction (128-dim, 1024 patches) |
| Storage per page | ~3 KB | ~512 KB |
| 1M page index | ~3 GB | ~500 GB |
| Query latency (ANN) | ~5 ms | ~50 ms (with pre-filtering) |
| Accuracy (ViDoRe) | ~65 nDCG@5 | ~85 nDCG@5 |
When OCR Still Wins
Visual document retrieval is not universally superior. OCR-based pipelines remain the better choice when:
The strongest production systems use both: OCR for text extraction and keyword search, visual retrieval for semantic understanding and layout-aware queries. The two approaches are complementary, not competing.
MetaEmbed: Test-Time Compute Scaling for Multi-Vector Retrieval
MetaEmbed (ICLR 2026 Oral) introduced a technique called Meta Tokens that addresses the storage and compute cost of late interaction retrieval.
The core idea: instead of storing all 1024 patch vectors per page, train the model to compress them into a smaller set of "meta tokens" — say 32 or 64 vectors — that preserve the most retrieval-relevant information. At query time, you can choose how many meta tokens to use based on your latency budget.
This is analogous to Matryoshka embeddings (where you truncate a single vector to fewer dimensions) but applied to the multi-vector setting. A 1024-patch page compressed to 64 meta tokens uses 16x less storage while retaining most of the accuracy.
The result is a smooth accuracy-efficiency curve: use 16 meta tokens for fast, approximate retrieval; use 256 for high-accuracy rescoring. The same index supports both operating points.
Building This with Mixpeek
Mixpeek's pipeline architecture maps directly to visual document retrieval:
1. Ingest documents via the Assets API — PDF, DOCX, images, scanned forms all accepted 2. Configure extractors — run visual embeddings (SigLIP, ColPali) alongside traditional OCR and text embeddings on the same documents 3. Build a retriever with multiple stages — BM25 on extracted text for fast candidate generation, feature search on visual embeddings for semantic rescoring 4. Query across modalities — a single retriever query searches text, visual, and layout features simultaneously, with RRF fusion to combine scores
The key advantage of this approach: you do not have to choose between OCR and visual retrieval. Both run in the same pipeline, and the retriever fuses their results. Documents with clean text benefit from OCR precision. Documents with complex layouts benefit from visual understanding. The system adapts per-document without manual intervention.