ColPali · ColQwen2 · Late Interaction

Visual Document Retrieval, Production-Ready

Search PDFs by what they look like — charts, tables, equations, scans, and complex layouts — using ColPali, ColQwen2, and late-interaction MaxSim scoring. No OCR. No layout parser. No brittle chunking pipeline.

Text-Only RAG Was Never Going to Work for Real Documents

Real-world PDFs are not paragraphs of clean prose. They are charts, tables, equations, multi-column layouts, footnotes, stamped scans, and rendered figures. Every step of a traditional OCR + chunk + embed pipeline silently throws information away.

OCR Is the Wrong Abstraction

OCR was designed to recover plain text from scans. It does not understand charts, tables, mathematical notation, or layout — and a single bad character cascade poisons every downstream chunk.

Chunking Loses the Page

Splitting a document into 512-token chunks discards the visual hierarchy that gives the page its meaning. A figure caption ends up far from its figure, a table header far from its rows.

Single Vectors Lose Detail

Averaging an entire page (or chunk) into one dense vector erases the fine-grained alignment between a question and the specific region of the page that answers it. Late interaction recovers it.

How Visual Document Retrieval Works

Two hops instead of four. Render the page, embed it with a vision-language model, search with late interaction, return exact pages.

Render Pages as Images

Each PDF page becomes a rendered image. No OCR pipeline, no layout parser, no markdown converter — the raw visual content of the page is the source of truth.

Embed with a Vision-Language Model

ColPali, ColQwen2, or ColSmol encode each page into a grid of patch embeddings. The model has already learned to read text, charts, tables, equations, and layout jointly.

Match with Late Interaction (MaxSim)

At query time, the question is encoded as token embeddings. MaxSim scoring compares each query token against every page patch and sums the best matches — recovering fine-grained relevance text-only RAG misses.

Return Exact Pages, Not Chunks

Results come back as the source PDF, page number, page image, and similarity score. Feed the matched page images straight into a multimodal LLM for grounded answers — no chunk reassembly required.

One pipeline, every modality

Mixpeek's render → embed pipeline runs on the same warehouse that handles your video, image, and audio collections. Build a unified retrieval surface across every modality your business produces.

ColPali, ColQwen2, ColSmol — Choose Your Encoder

Different visual document retrievers trade off recall, latency, multilingual coverage, and cost. Mixpeek runs all of them on managed GPU infrastructure.

Model	Backbone	Embedding Shape	ViDoRe v1 nDCG@5	License	When to Use
ColPali	PaliGemma 3B	128 × 1024 patches	81.3 (ViDoRe v1)	Gemma / MIT (code)	The original late-interaction VLM retriever. Strong baseline, broad community support, easy to fine-tune.
ColQwen2	Qwen2-VL 2B / 7B	128 × 1024 patches	87.3 (ViDoRe v1)	Apache 2.0 (code) / Qwen license	Current SOTA on ViDoRe. Multilingual, handles dense text and small fonts. Default recommendation for production.
ColSmol	SmolVLM 256M / 500M	64 × 768 patches	76.8 (ViDoRe v1)	Apache 2.0	Tiny, CPU-friendly footprint. Best for edge, on-prem, or cost-constrained workloads.
DSE-Qwen2	Qwen2-VL 2B	single 1536-d vector	85.8 (ViDoRe v1)	Apache 2.0	Single-vector alternative. Cheaper to store and search; modest recall trade-off vs. multi-vector ColQwen2.

Compare every embedding model Mixpeek supports

Visual Document Retrieval vs OCR + RAG vs Text Embeddings

A side-by-side look at where each approach holds up — and where it falls apart.

Capability	Visual Document Retrieval	OCR + RAG	Text Embeddings
Charts, tables, and figures	Native — encoded as part of the page	Lost or mangled by OCR + chunker	Invisible
Scanned and stamped pages	Works directly on the rendered image	OCR errors compound through chunking	Fails entirely
Multilingual & non-Latin scripts	Handled by the VLM tokenizer	Requires language-specific OCR stacks	Embedding model dependent
Mathematical equations	Preserved in pixel space	Garbled by Mathpix / Tesseract	Lost
Layout, columns, footnotes	Implicit in the patch grid	Depends on layout parser quality	Flattened to a single stream
Pipeline complexity	Render → embed (1 hop)	OCR → layout → chunk → embed (4+ hops)	Parse → chunk → embed (3 hops)
Storage cost per page	~150–500 KB (multi-vector)	~5–20 KB (text + vectors)	~5–20 KB
Recall on visual queries	State of the art (ViDoRe SOTA)	Weak on charts/figures	Near zero on visual content

Where Visual Document Retrieval Wins

Anywhere the page itself carries the meaning — financial filings, claims forms, contracts, scientific papers, manuals, lab reports.

Financial Filings & Reports

Search 10-Ks, prospectuses, and earnings decks by what they actually look like. Charts, tables, footnotes, and pixel-rendered figures all become first-class queryable content — no brittle OCR + chunking pipeline.

See use case

Insurance Claims & Forms

Index scanned claim forms, ACORD documents, and adjuster reports with their full visual layout. Retrieve by question, by region, or by similar claim — even when handwriting and stamps would defeat OCR.

See use case

Legal Contracts & Discovery

Match clauses, signature blocks, exhibits, and stamped pages across millions of contracts. Late-interaction retrieval recovers structure and formatting that text-only RAG silently throws away.

See use case

Scientific & Pharma Papers

Question-answer over figures, equations, and chemical structures. Retrieve the exact page region that contains the relevant chart or assay table — not the closest paragraph of body text.

See use case

Technical Manuals & SOPs

Index diagrams, exploded views, and parts tables alongside body copy. Field technicians query in natural language and get back the right page of the right manual, instantly.

See use case

Clinical & Lab Reports

Search across PDFs of trial protocols, lab results, and patient charts where structure and visual hierarchy carry the meaning. Visual document retrieval preserves both.

See use case

Build Visual Document Retrieval in Minutes

Drop in your PDFs, pick a vision-language encoder, and call a single retriever endpoint. No OCR pipeline to maintain.

visual_document_retrieval.py

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

# 1. Create a namespace for your document corpus
client.namespaces.create(
    namespace_name="visual-docs",
    description="Visual document retrieval over PDFs with ColPali / ColQwen",
)

# 2. Define a collection that renders each PDF page to an image
#    and embeds it with a vision-language model (ColPali / ColQwen2 family).
#    No OCR. No layout parsing. No brittle chunking.
client.collections.create(
    collection_name="pdf-pages",
    feature_extractors=[
        {"type": "pdf_page_render", "dpi": 144},
        {
            "type": "image_embedding",
            "model": "colqwen2",          # or "colpali", "colsmol"
            "multi_vector": True,          # late-interaction (MaxSim)
        },
    ],
)

# 3. Upload PDFs and trigger automatic processing
client.buckets.upload(
    bucket_name="enterprise-pdfs",
    files=["10k_2024.pdf", "claim_form_4471.pdf", "..."],
    auto_process=True,
)

# 4. Build a visual document retriever with late-interaction scoring
retriever = client.retrievers.create(
    retriever_name="visual_doc_retriever",
    inputs=[{"name": "question", "type": "text"}],
    settings={
        "stages": [
            {"type": "feature_search", "method": "multi_vector",
             "scoring": "max_sim", "limit": 50},
            {"type": "rerank", "model": "monoqwen-vision", "limit": 10},
        ]
    },
)

# 5. Ask a natural-language question — get back exact page images
results = client.retrievers.execute(
    retriever_id=retriever.retriever_id,
    inputs={"question": "What was Q3 2024 free cash flow vs guidance?"},
)

for doc in results.documents:
    print(f"{doc.metadata['source_pdf']}  page={doc.metadata['page']}  "
          f"score={doc.score:.3f}  page_image={doc.metadata['page_image_url']}")

Managed GPU encoders

ColPali, ColQwen2, and ColSmol run on auto-scaling GPU infrastructure. No model serving, no quantization tuning, no Triton ops on your roadmap.

Multilingual by default

ColQwen2 handles Chinese, Japanese, Korean, Arabic, Hindi, and most major European scripts without language-specific OCR stacks.

Hybrid with text + filters

Compose multi-vector visual search with structured filters and text retrievers in a single retriever pipeline. Works the same way as the rest of Mixpeek.

Frequently Asked Questions

What is visual document retrieval?

Visual document retrieval is a technique for searching PDFs and other documents by treating each page as an image and embedding it with a vision-language model. Instead of running OCR, parsing layout, and chunking text, you render each page, encode it with a model like ColPali or ColQwen2, and match queries against the resulting page-level embeddings. The model has already learned to read text, tables, charts, and equations jointly — so charts, scanned pages, and complex layouts that defeat traditional OCR + RAG pipelines become first-class searchable content.

What is ColPali?

ColPali is a vision-language model retriever published by Faysse et al. (2024) and accepted at ICLR 2025. It is built on PaliGemma 3B and uses late-interaction scoring (MaxSim, the same idea as ColBERT) over patch-level embeddings produced for each rendered PDF page. ColPali introduced the ViDoRe benchmark and demonstrated that you can skip OCR entirely for document retrieval and still beat the strongest text-based RAG pipelines, especially on visually-rich content. It is the foundational model in the visual document retrieval family.

What is ColQwen2 and how is it different from ColPali?

ColQwen2 swaps PaliGemma for Qwen2-VL as the vision-language backbone, keeping the same late-interaction MaxSim training recipe. It is currently the state of the art on ViDoRe v1/v2 and handles multilingual content, dense text, and small fonts noticeably better than the original ColPali. For most production workloads in 2025–2026, ColQwen2 (2B or 7B) is the default recommendation. ColPali remains a strong, slightly cheaper alternative.

How does late interaction (MaxSim) work?

Late interaction encodes each document into many small embeddings (one per page patch) and each query into many small embeddings (one per token), then scores a (query, document) pair by, for each query token, taking the maximum dot product across all document patches and summing those maxima. This recovers fine-grained alignment between query terms and specific page regions — far more precise than averaging everything into a single vector. The trade-off is roughly 50–500 KB of storage per page instead of a few KB.

Do I need to run OCR to use visual document retrieval?

No. That is the entire point. The vision-language model encoder reads pixels directly — text, equations, tables, charts, signatures, stamps, handwriting — and produces embeddings that already encode all of that information. Skipping OCR removes a major source of pipeline complexity and silent recall failures.

How does visual document retrieval compare to traditional OCR + RAG?

Traditional OCR + RAG pipelines have four hops: OCR → layout parsing → text chunking → embedding. Each hop loses information, especially for charts, tables, multi-column layouts, scanned pages, and non-Latin scripts. Visual document retrieval has two hops: render → embed. On the ViDoRe benchmark, visual document retrievers (ColPali, ColQwen2) consistently beat the strongest text-based RAG baselines on visually-rich corpora, with the largest gaps on charts, tables, and scanned content.

Does visual document retrieval work on scanned PDFs and images?

Yes — and this is one of the strongest reasons to adopt it. Because the encoder operates on pixels, scanned pages, photos of forms, and image-only PDFs work without any OCR step. The same model that handles a born-digital 10-K also handles a 30-year-old scanned contract.

What languages are supported?

ColQwen2 inherits Qwen2-VL's multilingual coverage, which includes Chinese, Japanese, Korean, Arabic, Hindi, Russian, and most major European languages. ColPali (PaliGemma) is strongest on Latin scripts but handles other scripts at reduced quality. For non-English-dominant corpora, ColQwen2 is the default choice.

How much does it cost to store visual document embeddings at scale?

Multi-vector ColPali / ColQwen2 embeddings are roughly 50–500 KB per page depending on patch grid size and quantization. A 10-million page corpus is on the order of 1–5 TB raw, compressible 4–8x with binary or product quantization. For corpora where storage is a hard constraint, single-vector alternatives like DSE-Qwen2 trade a few points of recall for ~30x smaller indexes.

What is the ViDoRe benchmark?

ViDoRe (Visual Document Retrieval Benchmark) is the standard evaluation for visual document retrieval, introduced alongside ColPali. It measures nDCG@5 over a mix of academic, industrial, and synthetic document collections in multiple languages, covering charts, tables, infographics, and dense text. ViDoRe v1, v2, and v3 are the canonical leaderboards for comparing models like ColPali, ColQwen2, ColSmol, and DSE-Qwen2.

How does Mixpeek support visual document retrieval?

Mixpeek ingests PDFs through a bucket, renders each page to an image, and runs a vision-language model extractor (ColPali, ColQwen2, or ColSmol) to produce multi-vector embeddings. A retriever pipeline then runs late-interaction MaxSim search and optional reranking with a vision reranker (e.g., MonoQwen-Vision). The same warehouse stores text, image, audio, and video alongside your visual document embeddings — one ingestion API, one retrieval API, one billing line.

Can I combine visual document retrieval with metadata filters and text search?

Yes. Mixpeek retriever pipelines compose stages: filter on structured metadata (source, date, jurisdiction), then run multi-vector visual search, then optionally rerank with a vision reranker or fuse with a text retriever. This is the same pattern as hybrid search for text RAG, extended to visual documents.

Stop Babysitting OCR Pipelines

Render your PDFs, embed them with a vision-language model, and search by what your documents actually look like. Visual document retrieval is the production-ready way to ship PDF RAG that works on real-world content.