Text-Only RAG Was Never Going to Work for Real Documents
Real-world PDFs are not paragraphs of clean prose. They are charts, tables, equations, multi-column layouts, footnotes, stamped scans, and rendered figures. Every step of a traditional OCR + chunk + embed pipeline silently throws information away.
OCR Is the Wrong Abstraction
OCR was designed to recover plain text from scans. It does not understand charts, tables, mathematical notation, or layout — and a single bad character cascade poisons every downstream chunk.
Chunking Loses the Page
Splitting a document into 512-token chunks discards the visual hierarchy that gives the page its meaning. A figure caption ends up far from its figure, a table header far from its rows.
Single Vectors Lose Detail
Averaging an entire page (or chunk) into one dense vector erases the fine-grained alignment between a question and the specific region of the page that answers it. Late interaction recovers it.
How Visual Document Retrieval Works
Two hops instead of four. Render the page, embed it with a vision-language model, search with late interaction, return exact pages.
Render Pages as Images
Each PDF page becomes a rendered image. No OCR pipeline, no layout parser, no markdown converter — the raw visual content of the page is the source of truth.
Embed with a Vision-Language Model
ColPali, ColQwen2, or ColSmol encode each page into a grid of patch embeddings. The model has already learned to read text, charts, tables, equations, and layout jointly.
Match with Late Interaction (MaxSim)
At query time, the question is encoded as token embeddings. MaxSim scoring compares each query token against every page patch and sums the best matches — recovering fine-grained relevance text-only RAG misses.
Return Exact Pages, Not Chunks
Results come back as the source PDF, page number, page image, and similarity score. Feed the matched page images straight into a multimodal LLM for grounded answers — no chunk reassembly required.
Mixpeek's render → embed pipeline runs on the same warehouse that handles your video, image, and audio collections. Build a unified retrieval surface across every modality your business produces.
ColPali, ColQwen2, ColSmol — Choose Your Encoder
Different visual document retrievers trade off recall, latency, multilingual coverage, and cost. Mixpeek runs all of them on managed GPU infrastructure.
| Model | Backbone | Embedding Shape | ViDoRe v1 nDCG@5 | License | When to Use |
|---|---|---|---|---|---|
| ColPali | PaliGemma 3B | 128 × 1024 patches | 81.3 (ViDoRe v1) | Gemma / MIT (code) | The original late-interaction VLM retriever. Strong baseline, broad community support, easy to fine-tune. |
| ColQwen2 | Qwen2-VL 2B / 7B | 128 × 1024 patches | 87.3 (ViDoRe v1) | Apache 2.0 (code) / Qwen license | Current SOTA on ViDoRe. Multilingual, handles dense text and small fonts. Default recommendation for production. |
| ColSmol | SmolVLM 256M / 500M | 64 × 768 patches | 76.8 (ViDoRe v1) | Apache 2.0 | Tiny, CPU-friendly footprint. Best for edge, on-prem, or cost-constrained workloads. |
| DSE-Qwen2 | Qwen2-VL 2B | single 1536-d vector | 85.8 (ViDoRe v1) | Apache 2.0 | Single-vector alternative. Cheaper to store and search; modest recall trade-off vs. multi-vector ColQwen2. |
Visual Document Retrieval vs OCR + RAG vs Text Embeddings
A side-by-side look at where each approach holds up — and where it falls apart.
| Capability | Visual Document Retrieval | OCR + RAG | Text Embeddings |
|---|---|---|---|
| Charts, tables, and figures | Native — encoded as part of the page | Lost or mangled by OCR + chunker | Invisible |
| Scanned and stamped pages | Works directly on the rendered image | OCR errors compound through chunking | Fails entirely |
| Multilingual & non-Latin scripts | Handled by the VLM tokenizer | Requires language-specific OCR stacks | Embedding model dependent |
| Mathematical equations | Preserved in pixel space | Garbled by Mathpix / Tesseract | Lost |
| Layout, columns, footnotes | Implicit in the patch grid | Depends on layout parser quality | Flattened to a single stream |
| Pipeline complexity | Render → embed (1 hop) | OCR → layout → chunk → embed (4+ hops) | Parse → chunk → embed (3 hops) |
| Storage cost per page | ~150–500 KB (multi-vector) | ~5–20 KB (text + vectors) | ~5–20 KB |
| Recall on visual queries | State of the art (ViDoRe SOTA) | Weak on charts/figures | Near zero on visual content |
Where Visual Document Retrieval Wins
Anywhere the page itself carries the meaning — financial filings, claims forms, contracts, scientific papers, manuals, lab reports.
Financial Filings & Reports
Search 10-Ks, prospectuses, and earnings decks by what they actually look like. Charts, tables, footnotes, and pixel-rendered figures all become first-class queryable content — no brittle OCR + chunking pipeline.
Insurance Claims & Forms
Index scanned claim forms, ACORD documents, and adjuster reports with their full visual layout. Retrieve by question, by region, or by similar claim — even when handwriting and stamps would defeat OCR.
Legal Contracts & Discovery
Match clauses, signature blocks, exhibits, and stamped pages across millions of contracts. Late-interaction retrieval recovers structure and formatting that text-only RAG silently throws away.
Scientific & Pharma Papers
Question-answer over figures, equations, and chemical structures. Retrieve the exact page region that contains the relevant chart or assay table — not the closest paragraph of body text.
Technical Manuals & SOPs
Index diagrams, exploded views, and parts tables alongside body copy. Field technicians query in natural language and get back the right page of the right manual, instantly.
Clinical & Lab Reports
Search across PDFs of trial protocols, lab results, and patient charts where structure and visual hierarchy carry the meaning. Visual document retrieval preserves both.
Build Visual Document Retrieval in Minutes
Drop in your PDFs, pick a vision-language encoder, and call a single retriever endpoint. No OCR pipeline to maintain.
from mixpeek import Mixpeek
client = Mixpeek(api_key="YOUR_API_KEY")
# 1. Create a namespace for your document corpus
client.namespaces.create(
namespace_name="visual-docs",
description="Visual document retrieval over PDFs with ColPali / ColQwen",
)
# 2. Define a collection that renders each PDF page to an image
# and embeds it with a vision-language model (ColPali / ColQwen2 family).
# No OCR. No layout parsing. No brittle chunking.
client.collections.create(
collection_name="pdf-pages",
feature_extractors=[
{"type": "pdf_page_render", "dpi": 144},
{
"type": "image_embedding",
"model": "colqwen2", # or "colpali", "colsmol"
"multi_vector": True, # late-interaction (MaxSim)
},
],
)
# 3. Upload PDFs and trigger automatic processing
client.buckets.upload(
bucket_name="enterprise-pdfs",
files=["10k_2024.pdf", "claim_form_4471.pdf", "..."],
auto_process=True,
)
# 4. Build a visual document retriever with late-interaction scoring
retriever = client.retrievers.create(
retriever_name="visual_doc_retriever",
inputs=[{"name": "question", "type": "text"}],
settings={
"stages": [
{"type": "feature_search", "method": "multi_vector",
"scoring": "max_sim", "limit": 50},
{"type": "rerank", "model": "monoqwen-vision", "limit": 10},
]
},
)
# 5. Ask a natural-language question — get back exact page images
results = client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"question": "What was Q3 2024 free cash flow vs guidance?"},
)
for doc in results.documents:
print(f"{doc.metadata['source_pdf']} page={doc.metadata['page']} "
f"score={doc.score:.3f} page_image={doc.metadata['page_image_url']}")Managed GPU encoders
ColPali, ColQwen2, and ColSmol run on auto-scaling GPU infrastructure. No model serving, no quantization tuning, no Triton ops on your roadmap.
Multilingual by default
ColQwen2 handles Chinese, Japanese, Korean, Arabic, Hindi, and most major European scripts without language-specific OCR stacks.
Hybrid with text + filters
Compose multi-vector visual search with structured filters and text retrievers in a single retriever pipeline. Works the same way as the rest of Mixpeek.
Frequently Asked Questions
What is visual document retrieval?
Visual document retrieval is a technique for searching PDFs and other documents by treating each page as an image and embedding it with a vision-language model. Instead of running OCR, parsing layout, and chunking text, you render each page, encode it with a model like ColPali or ColQwen2, and match queries against the resulting page-level embeddings. The model has already learned to read text, tables, charts, and equations jointly — so charts, scanned pages, and complex layouts that defeat traditional OCR + RAG pipelines become first-class searchable content.
What is ColPali?
ColPali is a vision-language model retriever published by Faysse et al. (2024) and accepted at ICLR 2025. It is built on PaliGemma 3B and uses late-interaction scoring (MaxSim, the same idea as ColBERT) over patch-level embeddings produced for each rendered PDF page. ColPali introduced the ViDoRe benchmark and demonstrated that you can skip OCR entirely for document retrieval and still beat the strongest text-based RAG pipelines, especially on visually-rich content. It is the foundational model in the visual document retrieval family.
What is ColQwen2 and how is it different from ColPali?
ColQwen2 swaps PaliGemma for Qwen2-VL as the vision-language backbone, keeping the same late-interaction MaxSim training recipe. It is currently the state of the art on ViDoRe v1/v2 and handles multilingual content, dense text, and small fonts noticeably better than the original ColPali. For most production workloads in 2025–2026, ColQwen2 (2B or 7B) is the default recommendation. ColPali remains a strong, slightly cheaper alternative.
How does late interaction (MaxSim) work?
Late interaction encodes each document into many small embeddings (one per page patch) and each query into many small embeddings (one per token), then scores a (query, document) pair by, for each query token, taking the maximum dot product across all document patches and summing those maxima. This recovers fine-grained alignment between query terms and specific page regions — far more precise than averaging everything into a single vector. The trade-off is roughly 50–500 KB of storage per page instead of a few KB.
Do I need to run OCR to use visual document retrieval?
No. That is the entire point. The vision-language model encoder reads pixels directly — text, equations, tables, charts, signatures, stamps, handwriting — and produces embeddings that already encode all of that information. Skipping OCR removes a major source of pipeline complexity and silent recall failures.
How does visual document retrieval compare to traditional OCR + RAG?
Traditional OCR + RAG pipelines have four hops: OCR → layout parsing → text chunking → embedding. Each hop loses information, especially for charts, tables, multi-column layouts, scanned pages, and non-Latin scripts. Visual document retrieval has two hops: render → embed. On the ViDoRe benchmark, visual document retrievers (ColPali, ColQwen2) consistently beat the strongest text-based RAG baselines on visually-rich corpora, with the largest gaps on charts, tables, and scanned content.
Does visual document retrieval work on scanned PDFs and images?
Yes — and this is one of the strongest reasons to adopt it. Because the encoder operates on pixels, scanned pages, photos of forms, and image-only PDFs work without any OCR step. The same model that handles a born-digital 10-K also handles a 30-year-old scanned contract.
What languages are supported?
ColQwen2 inherits Qwen2-VL's multilingual coverage, which includes Chinese, Japanese, Korean, Arabic, Hindi, Russian, and most major European languages. ColPali (PaliGemma) is strongest on Latin scripts but handles other scripts at reduced quality. For non-English-dominant corpora, ColQwen2 is the default choice.
How much does it cost to store visual document embeddings at scale?
Multi-vector ColPali / ColQwen2 embeddings are roughly 50–500 KB per page depending on patch grid size and quantization. A 10-million page corpus is on the order of 1–5 TB raw, compressible 4–8x with binary or product quantization. For corpora where storage is a hard constraint, single-vector alternatives like DSE-Qwen2 trade a few points of recall for ~30x smaller indexes.
What is the ViDoRe benchmark?
ViDoRe (Visual Document Retrieval Benchmark) is the standard evaluation for visual document retrieval, introduced alongside ColPali. It measures nDCG@5 over a mix of academic, industrial, and synthetic document collections in multiple languages, covering charts, tables, infographics, and dense text. ViDoRe v1, v2, and v3 are the canonical leaderboards for comparing models like ColPali, ColQwen2, ColSmol, and DSE-Qwen2.
How does Mixpeek support visual document retrieval?
Mixpeek ingests PDFs through a bucket, renders each page to an image, and runs a vision-language model extractor (ColPali, ColQwen2, or ColSmol) to produce multi-vector embeddings. A retriever pipeline then runs late-interaction MaxSim search and optional reranking with a vision reranker (e.g., MonoQwen-Vision). The same warehouse stores text, image, audio, and video alongside your visual document embeddings — one ingestion API, one retrieval API, one billing line.
Can I combine visual document retrieval with metadata filters and text search?
Yes. Mixpeek retriever pipelines compose stages: filter on structured metadata (source, date, jurisdiction), then run multi-vector visual search, then optionally rerank with a vision reranker or fuse with a text retriever. This is the same pattern as hybrid search for text RAG, extended to visual documents.
