Mixpeek Logo
    Login / Signup
    ColPali · ColQwen2 · Late Interaction

    Visual Document Retrieval, Production-Ready

    Search PDFs by what they look like — charts, tables, equations, scans, and complex layouts — using ColPali, ColQwen2, and late-interaction MaxSim scoring. No OCR. No layout parser. No brittle chunking pipeline.

    Text-Only RAG Was Never Going to Work for Real Documents

    Real-world PDFs are not paragraphs of clean prose. They are charts, tables, equations, multi-column layouts, footnotes, stamped scans, and rendered figures. Every step of a traditional OCR + chunk + embed pipeline silently throws information away.

    OCR Is the Wrong Abstraction

    OCR was designed to recover plain text from scans. It does not understand charts, tables, mathematical notation, or layout — and a single bad character cascade poisons every downstream chunk.

    Chunking Loses the Page

    Splitting a document into 512-token chunks discards the visual hierarchy that gives the page its meaning. A figure caption ends up far from its figure, a table header far from its rows.

    Single Vectors Lose Detail

    Averaging an entire page (or chunk) into one dense vector erases the fine-grained alignment between a question and the specific region of the page that answers it. Late interaction recovers it.

    How Visual Document Retrieval Works

    Two hops instead of four. Render the page, embed it with a vision-language model, search with late interaction, return exact pages.

    Render Pages as Images

    Each PDF page becomes a rendered image. No OCR pipeline, no layout parser, no markdown converter — the raw visual content of the page is the source of truth.

    Embed with a Vision-Language Model

    ColPali, ColQwen2, or ColSmol encode each page into a grid of patch embeddings. The model has already learned to read text, charts, tables, equations, and layout jointly.

    Match with Late Interaction (MaxSim)

    At query time, the question is encoded as token embeddings. MaxSim scoring compares each query token against every page patch and sums the best matches — recovering fine-grained relevance text-only RAG misses.

    Return Exact Pages, Not Chunks

    Results come back as the source PDF, page number, page image, and similarity score. Feed the matched page images straight into a multimodal LLM for grounded answers — no chunk reassembly required.

    One pipeline, every modality

    Mixpeek's render → embed pipeline runs on the same warehouse that handles your video, image, and audio collections. Build a unified retrieval surface across every modality your business produces.

    ColPali, ColQwen2, ColSmol — Choose Your Encoder

    Different visual document retrievers trade off recall, latency, multilingual coverage, and cost. Mixpeek runs all of them on managed GPU infrastructure.

    ModelBackboneEmbedding ShapeViDoRe v1 nDCG@5LicenseWhen to Use
    ColPaliPaliGemma 3B128 × 1024 patches81.3 (ViDoRe v1)Gemma / MIT (code)The original late-interaction VLM retriever. Strong baseline, broad community support, easy to fine-tune.
    ColQwen2Qwen2-VL 2B / 7B128 × 1024 patches87.3 (ViDoRe v1)Apache 2.0 (code) / Qwen licenseCurrent SOTA on ViDoRe. Multilingual, handles dense text and small fonts. Default recommendation for production.
    ColSmolSmolVLM 256M / 500M64 × 768 patches76.8 (ViDoRe v1)Apache 2.0Tiny, CPU-friendly footprint. Best for edge, on-prem, or cost-constrained workloads.
    DSE-Qwen2Qwen2-VL 2Bsingle 1536-d vector85.8 (ViDoRe v1)Apache 2.0Single-vector alternative. Cheaper to store and search; modest recall trade-off vs. multi-vector ColQwen2.

    Visual Document Retrieval vs OCR + RAG vs Text Embeddings

    A side-by-side look at where each approach holds up — and where it falls apart.

    CapabilityVisual Document RetrievalOCR + RAGText Embeddings
    Charts, tables, and figuresNative — encoded as part of the pageLost or mangled by OCR + chunkerInvisible
    Scanned and stamped pagesWorks directly on the rendered imageOCR errors compound through chunkingFails entirely
    Multilingual & non-Latin scriptsHandled by the VLM tokenizerRequires language-specific OCR stacksEmbedding model dependent
    Mathematical equationsPreserved in pixel spaceGarbled by Mathpix / TesseractLost
    Layout, columns, footnotesImplicit in the patch gridDepends on layout parser qualityFlattened to a single stream
    Pipeline complexityRender → embed (1 hop)OCR → layout → chunk → embed (4+ hops)Parse → chunk → embed (3 hops)
    Storage cost per page~150–500 KB (multi-vector)~5–20 KB (text + vectors)~5–20 KB
    Recall on visual queriesState of the art (ViDoRe SOTA)Weak on charts/figuresNear zero on visual content

    Where Visual Document Retrieval Wins

    Anywhere the page itself carries the meaning — financial filings, claims forms, contracts, scientific papers, manuals, lab reports.

    Build Visual Document Retrieval in Minutes

    Drop in your PDFs, pick a vision-language encoder, and call a single retriever endpoint. No OCR pipeline to maintain.

    visual_document_retrieval.py
    from mixpeek import Mixpeek
    
    client = Mixpeek(api_key="YOUR_API_KEY")
    
    # 1. Create a namespace for your document corpus
    client.namespaces.create(
        namespace_name="visual-docs",
        description="Visual document retrieval over PDFs with ColPali / ColQwen",
    )
    
    # 2. Define a collection that renders each PDF page to an image
    #    and embeds it with a vision-language model (ColPali / ColQwen2 family).
    #    No OCR. No layout parsing. No brittle chunking.
    client.collections.create(
        collection_name="pdf-pages",
        feature_extractors=[
            {"type": "pdf_page_render", "dpi": 144},
            {
                "type": "image_embedding",
                "model": "colqwen2",          # or "colpali", "colsmol"
                "multi_vector": True,          # late-interaction (MaxSim)
            },
        ],
    )
    
    # 3. Upload PDFs and trigger automatic processing
    client.buckets.upload(
        bucket_name="enterprise-pdfs",
        files=["10k_2024.pdf", "claim_form_4471.pdf", "..."],
        auto_process=True,
    )
    
    # 4. Build a visual document retriever with late-interaction scoring
    retriever = client.retrievers.create(
        retriever_name="visual_doc_retriever",
        inputs=[{"name": "question", "type": "text"}],
        settings={
            "stages": [
                {"type": "feature_search", "method": "multi_vector",
                 "scoring": "max_sim", "limit": 50},
                {"type": "rerank", "model": "monoqwen-vision", "limit": 10},
            ]
        },
    )
    
    # 5. Ask a natural-language question — get back exact page images
    results = client.retrievers.execute(
        retriever_id=retriever.retriever_id,
        inputs={"question": "What was Q3 2024 free cash flow vs guidance?"},
    )
    
    for doc in results.documents:
        print(f"{doc.metadata['source_pdf']}  page={doc.metadata['page']}  "
              f"score={doc.score:.3f}  page_image={doc.metadata['page_image_url']}")

    Managed GPU encoders

    ColPali, ColQwen2, and ColSmol run on auto-scaling GPU infrastructure. No model serving, no quantization tuning, no Triton ops on your roadmap.

    Multilingual by default

    ColQwen2 handles Chinese, Japanese, Korean, Arabic, Hindi, and most major European scripts without language-specific OCR stacks.

    Hybrid with text + filters

    Compose multi-vector visual search with structured filters and text retrievers in a single retriever pipeline. Works the same way as the rest of Mixpeek.

    Frequently Asked Questions

    What is visual document retrieval?

    Visual document retrieval is a technique for searching PDFs and other documents by treating each page as an image and embedding it with a vision-language model. Instead of running OCR, parsing layout, and chunking text, you render each page, encode it with a model like ColPali or ColQwen2, and match queries against the resulting page-level embeddings. The model has already learned to read text, tables, charts, and equations jointly — so charts, scanned pages, and complex layouts that defeat traditional OCR + RAG pipelines become first-class searchable content.

    What is ColPali?

    ColPali is a vision-language model retriever published by Faysse et al. (2024) and accepted at ICLR 2025. It is built on PaliGemma 3B and uses late-interaction scoring (MaxSim, the same idea as ColBERT) over patch-level embeddings produced for each rendered PDF page. ColPali introduced the ViDoRe benchmark and demonstrated that you can skip OCR entirely for document retrieval and still beat the strongest text-based RAG pipelines, especially on visually-rich content. It is the foundational model in the visual document retrieval family.

    What is ColQwen2 and how is it different from ColPali?

    ColQwen2 swaps PaliGemma for Qwen2-VL as the vision-language backbone, keeping the same late-interaction MaxSim training recipe. It is currently the state of the art on ViDoRe v1/v2 and handles multilingual content, dense text, and small fonts noticeably better than the original ColPali. For most production workloads in 2025–2026, ColQwen2 (2B or 7B) is the default recommendation. ColPali remains a strong, slightly cheaper alternative.

    How does late interaction (MaxSim) work?

    Late interaction encodes each document into many small embeddings (one per page patch) and each query into many small embeddings (one per token), then scores a (query, document) pair by, for each query token, taking the maximum dot product across all document patches and summing those maxima. This recovers fine-grained alignment between query terms and specific page regions — far more precise than averaging everything into a single vector. The trade-off is roughly 50–500 KB of storage per page instead of a few KB.

    Do I need to run OCR to use visual document retrieval?

    No. That is the entire point. The vision-language model encoder reads pixels directly — text, equations, tables, charts, signatures, stamps, handwriting — and produces embeddings that already encode all of that information. Skipping OCR removes a major source of pipeline complexity and silent recall failures.

    How does visual document retrieval compare to traditional OCR + RAG?

    Traditional OCR + RAG pipelines have four hops: OCR → layout parsing → text chunking → embedding. Each hop loses information, especially for charts, tables, multi-column layouts, scanned pages, and non-Latin scripts. Visual document retrieval has two hops: render → embed. On the ViDoRe benchmark, visual document retrievers (ColPali, ColQwen2) consistently beat the strongest text-based RAG baselines on visually-rich corpora, with the largest gaps on charts, tables, and scanned content.

    Does visual document retrieval work on scanned PDFs and images?

    Yes — and this is one of the strongest reasons to adopt it. Because the encoder operates on pixels, scanned pages, photos of forms, and image-only PDFs work without any OCR step. The same model that handles a born-digital 10-K also handles a 30-year-old scanned contract.

    What languages are supported?

    ColQwen2 inherits Qwen2-VL's multilingual coverage, which includes Chinese, Japanese, Korean, Arabic, Hindi, Russian, and most major European languages. ColPali (PaliGemma) is strongest on Latin scripts but handles other scripts at reduced quality. For non-English-dominant corpora, ColQwen2 is the default choice.

    How much does it cost to store visual document embeddings at scale?

    Multi-vector ColPali / ColQwen2 embeddings are roughly 50–500 KB per page depending on patch grid size and quantization. A 10-million page corpus is on the order of 1–5 TB raw, compressible 4–8x with binary or product quantization. For corpora where storage is a hard constraint, single-vector alternatives like DSE-Qwen2 trade a few points of recall for ~30x smaller indexes.

    What is the ViDoRe benchmark?

    ViDoRe (Visual Document Retrieval Benchmark) is the standard evaluation for visual document retrieval, introduced alongside ColPali. It measures nDCG@5 over a mix of academic, industrial, and synthetic document collections in multiple languages, covering charts, tables, infographics, and dense text. ViDoRe v1, v2, and v3 are the canonical leaderboards for comparing models like ColPali, ColQwen2, ColSmol, and DSE-Qwen2.

    How does Mixpeek support visual document retrieval?

    Mixpeek ingests PDFs through a bucket, renders each page to an image, and runs a vision-language model extractor (ColPali, ColQwen2, or ColSmol) to produce multi-vector embeddings. A retriever pipeline then runs late-interaction MaxSim search and optional reranking with a vision reranker (e.g., MonoQwen-Vision). The same warehouse stores text, image, audio, and video alongside your visual document embeddings — one ingestion API, one retrieval API, one billing line.

    Can I combine visual document retrieval with metadata filters and text search?

    Yes. Mixpeek retriever pipelines compose stages: filter on structured metadata (source, date, jurisdiction), then run multi-vector visual search, then optionally rerank with a vision reranker or fuse with a text retriever. This is the same pattern as hybrid search for text RAG, extended to visual documents.

    Stop Babysitting OCR Pipelines

    Render your PDFs, embed them with a vision-language model, and search by what your documents actually look like. Visual document retrieval is the production-ready way to ship PDF RAG that works on real-world content.