NEWManaged multimodal retrieval.Explore platform →
    Retrieval
    22 min read
    Updated 2026-05-21

    Late Interaction Retrieval: How ColBERT, ColPali, and ColQwen Search Without Losing Token-Level Detail

    A technical guide to late interaction retrieval models — the architecture that keeps per-token embeddings instead of compressing everything into a single vector. Covers how MaxSim scoring works, why it outperforms dense retrieval on complex queries, the evolution from ColBERT (text) to ColPali (visual documents) to ColQwen (multimodal), and how to build retrieval pipelines that combine late interaction with dense search.

    ColBERT
    ColPali
    ColQwen
    Late Interaction
    Retrieval
    Visual Search

    The Compression Problem in Dense Retrieval



    Dense retrieval works by compressing an entire document into a single vector — 768 or 1024 floating-point numbers that represent everything the document means. When you search, you compress your query the same way and find the nearest vectors.

    This works remarkably well for simple queries against short documents. "Find me articles about climate change" against a corpus of news paragraphs is a problem that single-vector retrieval handles cleanly.

    But compression is lossy. When you force a 4,000-token document into 768 dimensions, information gets destroyed. The model must decide what to keep and what to discard, and that decision happens at encoding time — before it knows what queries will be asked. A document about the 2024 Olympics contains information about swimming records, opening ceremony details, medal counts, venue logistics, and athlete biographies. A single vector cannot faithfully represent all of these facets simultaneously.

    The failure mode is predictable: dense retrieval misses documents that match on specific details rather than overall topic. A query like "What was the 100m freestyle time that broke the Olympic record?" requires matching on a precise factual detail buried in a long document. The single vector for that document emphasizes "Olympics" and "swimming" broadly, not the specific time value.

    Late interaction models solve this by refusing to compress. They keep every token's embedding and defer the comparison to query time.

    How Late Interaction Works



    The architecture has three phases: encoding, storage, and scoring.

    Encoding



    Both queries and documents pass through a transformer encoder (typically BERT-based), but instead of pooling the output into one vector, the model keeps all token embeddings. A 200-token document produces 200 vectors, each 128 dimensions (ColBERT's default).

    \`\`\` Dense encoding: "The 100m freestyle record was 46.40s" → [0.12, -0.34, 0.91, ...] (1 × 768)

    Late interaction encoding: "The 100m freestyle record was 46.40s" → [ [0.08, -0.21, ...], # "The" [0.45, 0.12, ...], # "100m" [0.67, -0.03, ...], # "freestyle" [0.33, 0.89, ...], # "record" [0.11, -0.44, ...], # "was" [0.92, 0.15, ...], # "46.40s" ] (6 × 128) \`\`\`

    The per-token dimension is smaller (128 vs 768) because each token carries less burden — it only needs to represent its own meaning in context, not the entire document's meaning.

    Storage



    Every token embedding is stored in the index. A corpus of 1 million documents with an average of 200 tokens each produces 200 million vectors. This is the primary cost of late interaction: storage scales with total tokens, not total documents.

    In practice, the vectors are quantized (2-bit or 4-bit) and indexed using optimized structures. ColBERTv2 introduced residual compression that reduces storage by ~6x while preserving >99% of retrieval quality.

    MaxSim Scoring



    The scoring function is what makes late interaction work. Given a query with Q tokens and a document with D tokens, the relevance score is computed as:

    \`\`\` score = sum over each query token q_i of: max over all document tokens d_j of: cosine_similarity(q_i, d_j) \`\`\`

    In plain language: for each query token, find the document token it matches best, then sum those best-match scores.

    This is the key insight. The query token "46.40s" will have a high cosine similarity with the document token "46.40s" and low similarity with everything else. In dense retrieval, that specific match would be diluted across 768 dimensions shared with every other concept in the document. In late interaction, the match is preserved at full strength.

    MaxSim handles partial matches gracefully. If a query has 8 tokens and a document matches 6 of them strongly, the score is high. If a different document matches all 8 but weakly, the score may be lower. This naturally captures the difference between broad topical relevance and specific factual matching.

    ColBERT: Where It Started



    ColBERT (Contextualized Late Interaction over BERT) was introduced in 2020 by Omar Khattab and Matei Zaharia. The original insight was that BERT's attention mechanism already produces rich per-token representations — the pooling step that follows is what destroys information.

    ColBERTv1



    The first version used BERT-base as the backbone, producing 128-dimensional token embeddings. It outperformed dense retrieval (DPR) on MS MARCO and Natural Questions while maintaining sub-second query latency through a two-stage approach:

    1. Candidate generation: use an approximate nearest-neighbor index over all document token vectors to find candidate documents (any document containing a token similar to any query token) 2. Exact scoring: compute full MaxSim only for the top candidates

    This two-stage approach meant ColBERT didn't need to score every document exhaustively — it used the token-level index as a first-pass filter.

    ColBERTv2



    The 2022 update addressed the storage problem. Three innovations:

    1. Residual compression: instead of storing each token vector directly, store the difference (residual) from the nearest centroid in a learned codebook. The centroid ID (2 bytes) plus the quantized residual (typically 2-4 bits × 128 dims) uses far less space than the full float vector.

    2. Denoised supervision: distill from a cross-encoder teacher to improve the quality of token-level representations. Cross-encoders are too slow for retrieval (they score query-document pairs jointly), but their training signal improves ColBERT's independent encodings.

    3. Multi-vector indexing: optimized PLAID (Performance-optimized Late Interaction Driver) engine for fast candidate generation and scoring.

    ColBERTv2 achieved SOTA on MS MARCO, BEIR, and LoTTE benchmarks while using 6-10x less storage than v1.

    RAGatouille and Practical ColBERT



    The RAGatouille library made ColBERT accessible for RAG applications. It wraps the indexing, retrieval, and reranking steps into a simple API:

    \`\`\`python from ragatouille import RAGPretrainedModel

    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") RAG.index(collection=documents, index_name="my_index") results = RAG.search(query="46.40s freestyle record", k=10) \`\`\`

    The practical impact: late interaction retrieval went from a research concept to a production-ready tool. RAG pipelines using ColBERT as the retriever consistently outperform dense-only retrieval on factoid questions, entity-rich queries, and multi-hop reasoning.

    ColPali: Late Interaction Meets Visual Documents



    ColPali (2024) extended the late interaction concept to visual document retrieval. The key question it answered: can you search documents by looking at them as images, without OCR, using late interaction?

    The Architecture



    ColPali uses PaliGemma (Google's vision-language model) as the backbone instead of BERT. The input is a document page rendered as an image (not extracted text). The vision transformer processes the image into a grid of patch embeddings — each patch corresponds to a spatial region of the document page.

    \`\`\` Text-based ColBERT: Document text → BERT → per-token embeddings → MaxSim

    Visual ColPali: Document image → PaliGemma ViT → per-patch embeddings → MaxSim \`\`\`

    The per-patch embeddings capture both the visual appearance and the semantic content of each region. A patch containing a table header "Revenue Q3" has a different embedding from a patch containing a chart or a photograph, even though all are just pixels.

    Why Patches Work Like Tokens



    The analogy between text tokens and image patches is deeper than it appears. In a text document, a token represents a word or subword in context — its meaning depends on surrounding tokens via attention. In a vision transformer, a patch represents a spatial region in context — its meaning depends on surrounding patches via the same attention mechanism.

    MaxSim over patches works the same way as MaxSim over tokens: for each query token ("revenue chart"), find the image patch that matches best. If the document contains a revenue chart in the lower-right quadrant, those patches will have high similarity with the query tokens. The spatial location doesn't matter — MaxSim finds the match wherever it occurs on the page.

    Results



    ColPali outperformed all existing visual document retrieval methods and most text-based methods on benchmarks like DocVQA and ViDoRe (Visual Document Retrieval). The results were surprising because:

    1. No OCR is needed — the model sees the page as an image 2. Layout information is preserved — tables, charts, and formatting are visible 3. Multi-language support comes free — the vision encoder doesn't care what script the text is in 4. Scanned and born-digital documents are handled identically

    The limitation is the same as all late interaction models: storage. Each page produces ~1000+ patch embeddings (depending on resolution), each 128 dimensions. A corpus of 1 million pages needs storage for ~1 billion vectors.

    ColQwen: Multimodal Late Interaction



    ColQwen (2024-2025) replaced PaliGemma with Qwen2-VL as the backbone, gaining several capabilities:

    1. Higher resolution: Qwen2-VL processes images at higher resolution than PaliGemma, producing more patch embeddings with finer detail 2. Dynamic resolution: the model adapts its patch grid to the document aspect ratio instead of forcing square crops 3. Stronger vision encoder: Qwen2-VL's vision tower outperforms PaliGemma on document understanding benchmarks

    ColQwen 2.5 (the current version, \`vidore/colqwen2.5-v0.2\` on HuggingFace) further improved by training on a larger and more diverse document corpus.

    ColQwen-Omni: Text + Visual Late Interaction



    The latest evolution is ColQwen-Omni (\`vidore/colqwen-omni-v0.1\`), which handles both text and visual inputs in a single model. It can:

  1. Encode text documents as token embeddings (like ColBERT)
  2. Encode document images as patch embeddings (like ColPali)
  3. Score queries against either representation using MaxSim


  4. This matters for mixed corpora where some documents are born-digital (text available) and others are scanned (image only). A single retrieval index can contain both text-based and image-based embeddings, and queries work against both.

    When Late Interaction Beats Dense Retrieval



    Late interaction is not universally better. It excels in specific scenarios:

    1. Entity-Rich Queries



    "Find the contract clause mentioning Acme Corp with a liability cap of $5M"

    Dense retrieval compresses "Acme Corp" and "$5M" into the same vector as everything else in the query. Late interaction preserves both as distinct token embeddings, finding documents where both entities appear with high per-token similarity.

    2. Factoid / Exact-Match Queries



    "What is the melting point of tungsten?"

    The answer "3,422°C" is a specific token that dense retrieval may not preserve in its single-vector compression. Late interaction's per-token matching catches this.

    3. Long Documents



    Documents over ~500 tokens lose information under dense compression. Late interaction preserves all tokens regardless of document length — the 5,000th token is represented as faithfully as the 1st.

    4. Visual Documents with Mixed Content



    A page containing text, a table, a chart, and a photograph is exactly the scenario where single-vector compression fails most. ColPali/ColQwen preserve each content region as separate patch embeddings.

    When Dense Retrieval Is Better



  5. Topical similarity queries: "Find articles about machine learning" — topic-level matching doesn't need token precision
  6. Storage-constrained environments: late interaction uses 50-200x more storage than dense
  7. Ultra-low-latency requirements: MaxSim scoring is slower than cosine similarity on a single vector
  8. Small corpora: with fewer than 100K documents, the quality difference is often negligible


  9. The Storage-Quality Tradeoff



    The practical engineering decision is always about storage. Here's the math:

    \`\`\` Dense retrieval: 1M documents × 1 vector × 768 dims × 4 bytes = 3 GB

    Late interaction: 1M documents × 200 tokens × 128 dims × 4 bytes = 102 GB (with 2-bit quantization: ~6 GB) \`\`\`

    ColBERTv2's residual compression brings this down significantly. With PLAID indexing and 2-bit quantization, storage is roughly 2-4x that of dense retrieval — a manageable premium for the quality improvement.

    For visual late interaction (ColPali/ColQwen), the numbers are higher because images produce more patches than text produces tokens:

    \`\`\` ColPali / ColQwen: 1M pages × ~1024 patches × 128 dims × 4 bytes = 524 GB (with quantization: ~30-60 GB) \`\`\`

    This is why hybrid approaches are common: use dense retrieval for first-pass candidate generation (cheap, fast, broad), then rescore the top candidates with late interaction (expensive, precise).

    Building a Late Interaction Pipeline



    A production pipeline combines late interaction with other retrieval stages. The pattern:

    Stage 1: Dense Candidate Generation



    Use a standard dense embedding model (CLIP, SigLIP, bge-m3) to find the top 100-500 candidates. This is fast because it's a single vector comparison.

    Stage 2: Late Interaction Reranking



    Rescore the candidates using ColBERT/ColPali/ColQwen MaxSim. This narrows to the top 10-50 with much higher precision.

    Stage 3: Cross-Encoder (Optional)



    For maximum precision, a cross-encoder (Ettin, Qwen3-Reranker) scores the final candidates jointly. This is the slowest but most accurate scoring method.

    \`\`\` Retrieval funnel: 1M documents → Dense search (top 200) → Late interaction rescore (top 20) → Cross-encoder rerank (top 5) \`\`\`

    Each stage is more expensive and more accurate than the previous one. The dense stage handles the majority of the elimination work; late interaction handles the precision work; the cross-encoder handles the final discrimination.

    Mixpeek Implementation



    Mixpeek supports late interaction models in both ingestion and retrieval pipelines. The key is that late interaction embeddings are stored as multi-vector features — each document produces multiple vectors rather than one.

    Ingest: Index Documents with Late Interaction



    \`\`\`python from mixpeek import Mixpeek

    mx = Mixpeek(api_key="YOUR_KEY")

    # Index visual documents with ColQwen for late interaction search mx.ingest( collection_id="contracts", source="s3://legal-docs/", extractors=[ { "type": "visual_embedding", "model": "vidore/colqwen2.5-v0.2", "output_feature": "colqwen_patches" }, { "type": "text_embedding", "model": "BAAI/bge-m3", "output_feature": "dense_embedding" } ] ) \`\`\`

    Retrieve: Multi-Stage with Late Interaction



    \`\`\`python results = await mx.retrievers.retrieve( queries=[{ "type": "text", "value": "indemnification clause capping liability at $2M" }], collection_ids=["contracts"], stages=[ { "type": "feature_search", "feature": "dense_embedding", "top_k": 200 }, { "type": "feature_search", "feature": "colqwen_patches", "scoring": "max_sim", "top_k": 20 }, { "type": "rerank", "model": "cross-encoder/ettin-reranker-1b-v1", "top_k": 5 } ] ) \`\`\`

    This pipeline first uses dense bge-m3 embeddings to find 200 candidate documents, then rescores with ColQwen's late interaction patches to narrow to the 20 most precisely matching pages, then applies a cross-encoder reranker for the final top 5.

    Evaluation: Measuring Late Interaction Quality



    Late interaction models are evaluated on retrieval benchmarks. The key metrics:

    Standard Benchmarks



  10. MS MARCO: 8.8M passages, ~500K queries. ColBERTv2 achieves 44.6 MRR@10, vs. ~39 for dense retrieval.
  11. BEIR: 18 diverse retrieval datasets. Late interaction models show the largest advantage on entity-heavy datasets (e.g., TREC-COVID, FiQA).
  12. ViDoRe: visual document retrieval benchmark. ColQwen 2.5 is SOTA at 87.3 nDCG@5.
  13. LoTTE: long-tail topic retrieval. ColBERT outperforms dense models by 5-15% nDCG@10 on rare topics.


  14. What the Numbers Mean



    The consistent pattern across benchmarks: late interaction's advantage grows as queries become more specific and documents become longer. On broad topical queries, the gap narrows to 1-2%. On entity-rich factoid queries, the gap widens to 10-20%.

    This directly informs pipeline design: if your queries are mostly topical ("find documents about renewable energy"), dense retrieval may be sufficient. If your queries are factual or entity-specific ("find the clause where Party A agrees to pay $X by date Y"), late interaction is worth the storage cost.

    The Future: Omni-Modal Late Interaction



    The trajectory is clear: late interaction is expanding from text to images to video to audio. The underlying principle — keep per-token detail instead of compressing — applies to any modality:

  15. Text: per-token embeddings (ColBERT)
  16. Images/Documents: per-patch embeddings (ColPali, ColQwen)
  17. Video: per-frame or per-spatiotemporal-patch embeddings
  18. Audio: per-segment embeddings


  19. Models like ColQwen-Omni already handle text and images. The next step is video and audio late interaction, where per-frame visual patches and per-segment audio features are stored and searched with MaxSim.

    For agents that need to search unstructured content across modalities, late interaction offers a fundamentally different quality profile than dense embeddings. The tradeoff — more storage for more precision — is increasingly favorable as storage costs fall and query specificity rises.

    Related Guides



  20. Visual Document Retrieval -- searching documents as images without OCR
  21. Cross-Encoder Reranking -- the final stage after late interaction retrieval
  22. Multi-Stage Retrieval -- combining multiple retrieval strategies
  23. Omnimodal Embeddings -- single-vector multimodal search for comparison
  24. Contrastive Learning -- the training paradigm behind dense embeddings
  25. Models -- browse ColPali, ColQwen, and other late interaction models
  26. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs