ColBERT (Contextualized Late Interaction over BERT) is a neural retrieval model that computes fine-grained similarity between queries and documents using per-token embeddings with a late interaction mechanism. It achieves the effectiveness of cross-encoders while maintaining the efficiency of bi-encoders through its MaxSim operation over token-level representations.
ColBERT independently encodes the query and document into sets of contextualized token embeddings using BERT-based encoders. At matching time, each query token embedding is compared against all document token embeddings, and the maximum similarity (MaxSim) for each query token is computed. The final relevance score is the sum of these per-token maximum similarities. This late interaction mechanism captures fine-grained token-level matching while allowing document embeddings to be precomputed and indexed.
ColBERT uses a shared BERT backbone with separate linear projection layers for query and document tokens, typically projecting to 128 dimensions. Document token embeddings are precomputed offline and stored (optionally with compression). At query time, only query tokens are encoded online. The MaxSim operation computes cosine similarity between each query token and all document tokens, taking the maximum per query token. ColBERTv2 introduced residual compression that reduces storage by 6-10x with minimal quality loss. The model is trained with pairwise or listwise ranking losses on query-document pairs.