A cross-encoder is a neural model that jointly processes a query and document together through a single transformer encoder to produce a relevance score. By allowing full attention between query and document tokens, cross-encoders achieve the highest accuracy in relevance ranking but at a significant computational cost that limits their use to reranking small candidate sets.
A cross-encoder takes the concatenation of a query and document as input, separated by a special token (e.g., [SEP]). The combined input is processed through a transformer model (typically BERT), and a classification head on top of the [CLS] token outputs a relevance score (0 to 1). Because all attention layers see both the query and document simultaneously, the model captures fine-grained interactions between specific query terms and document passages. This full cross-attention is what gives cross-encoders their superior accuracy.
Cross-encoders are typically fine-tuned from pretrained language models (BERT, RoBERTa, DeBERTa) using binary cross-entropy or margin-based ranking losses on labeled query-document pairs. The input format is '[CLS] query [SEP] document [SEP]' with a maximum total length of 512 tokens (or longer with extended-context models). Inference requires one forward pass per query-document pair, making it O(n) per query where n is the number of candidate documents. This cost restricts cross-encoders to reranking 100-1000 candidates retrieved by a faster first-stage method.
Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.
Start with ManagedKeep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.
Start with MVS