A cross-encoder is a neural model that jointly processes a query and document together through a single transformer encoder to produce a relevance score. By allowing full attention between query and document tokens, cross-encoders achieve the highest accuracy in relevance ranking but at a significant computational cost that limits their use to reranking small candidate sets.
A cross-encoder takes the concatenation of a query and document as input, separated by a special token (e.g., [SEP]). The combined input is processed through a transformer model (typically BERT), and a classification head on top of the [CLS] token outputs a relevance score (0 to 1). Because all attention layers see both the query and document simultaneously, the model captures fine-grained interactions between specific query terms and document passages. This full cross-attention is what gives cross-encoders their superior accuracy.
Cross-encoders are typically fine-tuned from pretrained language models (BERT, RoBERTa, DeBERTa) using binary cross-entropy or margin-based ranking losses on labeled query-document pairs. The input format is '[CLS] query [SEP] document [SEP]' with a maximum total length of 512 tokens (or longer with extended-context models). Inference requires one forward pass per query-document pair, making it O(n) per query where n is the number of candidate documents. This cost restricts cross-encoders to reranking 100-1000 candidates retrieved by a faster first-stage method.