Mixpeek Logo
    Schedule Demo
    ESEthan Steininger
    4 min read

    Understanding Late Interaction Models in Multimodal Retrieval

    Late interaction models enable precise retrieval from multimodal data like PDFs and images by comparing query tokens with token or patch embeddings—ideal for RAG, search, and document understanding.

    Understanding Late Interaction Models in Multimodal Retrieval
    Research

    Late interaction models are redefining how we perform retrieval over multimodal data—not just text, but also images, PDFs, scanned documents, and hybrid content. These models retain per-token or per-patch granularity, enabling rich semantic comparisons across modalities.

    Jo Kristian Bergum on X: "ColPali got much cheaper by pruning "redundant"  patch vectors from the PDF page representation. The intuition is that not  all grid cell vectors are equally important. Image

    Why Late Interaction for Multimodal Data?

    Dense vector models collapse meaning into a single representation. That works for simple ranking, but falls short when you want to:

    🔍 Search a scanned PDF for visual elements

    Query:

    “Show me pages with line charts illustrating monthly revenue changes.”

    Why dense fails:
    Dense models collapse visual structure into a single vector, losing spatial context or visual semantics like "line chart."

    Why late interaction works:
    Multimodal models like ColPaLI or ColQwen retain patch-level embeddings and can directly compare query tokens to image regions.


    🔍 Match a table and a caption from a financial report

    Query:

    “Find all tables labeled as ‘Quarterly Profit Margins’ with accompanying textual explanations.”

    Why dense fails:
    They can't understand the structured layout nor connect the visual table to its nearby caption semantically.

    Why late interaction works:
    Token-level and patch-level embeddings let models localize the table and contextually associate nearby text via MaxSim.


    🔍 Pull relevant sections from a scientific paper with embedded figures

    Query:

    “Retrieve sections discussing ‘Figure 2’ showing the model’s performance curve.”

    Why dense fails:
    References to figures and visual content aren’t preserved in a pooled embedding. The link between the text and the figure is lost.

    Why late interaction works:
    Late interaction models retain localized representations, enabling the system to align the text around “Figure 2” with the image patch it refers to.


    Late interaction models enable multi-vector retrieval by retaining subcomponent embeddings—whether they're text tokens or image patches—and using the MaxSim operator to match query tokens to document segments.

    An Overview of Late Interaction Retrieval Models: ColBERT, ColPali, and  ColQwen | Weaviate
    https://weaviate.io/blog/late-interaction-overview

    Types of Interaction in Retrieval Models

    1. No-Interaction (Bi-Encoder)

    • Compresses documents into one dense vector
    • Fast, cheap, but discards compositional nuance
    • Not suited for documents with structured layout or embedded media

    Example: OpenAI's text-embedding-3-small

    2. Full-Interaction (Cross-Encoder)

    • Jointly encodes query + document with full attention
    • High accuracy, especially in structured text
    • Infeasible for image-rich, multi-document corpora

    Example: MiniLM, BERT cross-encoders

    3. Late-Interaction (Multi-vector Models)

    • Encodes each token or patch independently
    • Stores multi-vector representations (128d, compressed)
    • Matches query vectors to doc vectors via MaxSim
    • Supports multimodal search across hybrid documents
    https://qdrant.tech/articles/late-interaction-models/

    How MaxSim Works in Multimodal Retrieval

    Let’s assume we want to search a financial PDF that contains charts, paragraphs, and footnotes.

    1. Convert the entire document to image
    2. Divide the image into uniform patches
    3. Pass patches through a VLM (e.g., PaliGemma, Qwen2-VL)
    4. Map image patch embeddings into a shared semantic space
    5. Tokenize the text query and embed it with a language model
    6. Use MaxSim to compute similarity between query tokens and visual patches
    ColPali: Redefining Multimodal RAG with Gemini
    https://learnopencv.com/multimodal-rag-with-colpali/

    Key Multimodal Late Interaction Models

    ColBERT

    • Text-only late interaction over BERT
    • 128d token vectors, MaxSim scoring
    • Commonly used in RAG pipelines

    ColPaLI

    • ColBERT over PaliGemma
    • Treats PDFs and scanned docs as images
    • Text queries are embedded with Gemma; image patches via Pali
    • Shared vector space enables text-image interaction

    ColQwen

    • ColBERT over Qwen2-VL
    • Smaller patch size (768 max), Apache 2.0 license
    • Efficient and permissive alternative to ColPaLI

    Tradeoffs

    Model Modalities Accuracy Speed Storage Explainability
    Bi-Encoder Text ⬆️
    Cross-Encoder Text ⬆️⬆️
    ColBERT Text ⬆️⬆️
    ColPaLI Multimodal ⬆️⬆️

    Challenges in Multimodal Production Retrieval

    • Storage Cost: Each patch/token vector adds up
    • Vector Indexing: Most vector DBs aren’t built for multi-vector (e.g., Pinecone, Milvus, Weaviate)
    • Inference Scaling: MaxSim requires computing dot products across all patch vectors
    • Cross-modal alignment: Requires robust training to unify image+text into one embedding space

    Why Mixpeek?

    At Mixpeek, we provide a production-grade infrastructure for multimodal late interaction retrieval, purpose-built for teams working with complex content like PDFs, video frames, scanned documents, and more. Specifically, we:

    • Embed and index multimodal files (text, images, video, audio) using modular feature extractors, including ColBERT-style late interaction models for fine-grained semantic matching.
    • Store and query multi-vector representations (token or patch-level) using an optimized retrieval layer that supports real-time and batch use cases.
    • Benchmark and version retrieval pipelines with tools for evaluation, A/B testing, rollbacks, and staged cutovers—so you can safely upgrade from ColBERT to ColPaLI or ColQwen.
    • Expose explainability out-of-the-box, showing which tokens or patches triggered a match—making semantic retrieval as transparent as keyword search.

    Summary

    Late interaction retrieval is not just for text. ColPaLI and ColQwen extend these models into the multimodal realm—powering search over documents, charts, and PDFs with no manual chunking, OCR, or preprocessing. They enable:

    • Text-to-image search
    • Query-based patch attention
    • Deep explainability via MaxSim

    Multimodal RAG and semantic search are finally here. Let Mixpeek handle the infrastructure.

    Join the Discussion

    Have thoughts, questions, or insights about this post? Be the first to start the conversation in our community!

    Start a Discussion
    ES
    Ethan Steininger

    May 8, 2025 · 4 min read