A retrieval approach that applies late-interaction matching (inspired by ColBERT) to document page images, enabling search over visually rich documents without requiring OCR or text extraction.
ColPali treats each document page as an image and encodes it using a vision-language model into a set of patch-level embedding vectors. Rather than producing a single vector per page, it generates one vector per image patch, preserving fine-grained spatial information about text, tables, figures, and layout. At query time, the text query is encoded into token-level vectors, and a late-interaction scoring function computes similarity by matching each query token against the most relevant document patches. This approach captures both textual content and visual structure without requiring a separate OCR step.
ColPali builds on the ColBERT late-interaction paradigm, replacing the text encoder with a vision-language model (typically based on PaliGemma or similar architectures). Each document page is rendered as an image and processed through the vision encoder to produce patch embeddings. The query is encoded through the language model into token embeddings. Scoring uses MaxSim: for each query token, the maximum similarity to any document patch is computed, and these per-token scores are summed. This enables efficient retrieval through pre-computation of document patch embeddings and approximate nearest-neighbor search at the patch level.