What is Cross-Encoder

Cross-Encoder - Joint query-document encoding model for precise relevance scoring

A cross-encoder is a neural model that jointly processes a query and document together through a single transformer encoder to produce a relevance score. By allowing full attention between query and document tokens, cross-encoders achieve the highest accuracy in relevance ranking but at a significant computational cost that limits their use to reranking small candidate sets.

How It Works

A cross-encoder takes the concatenation of a query and document as input, separated by a special token (e.g., [SEP]). The combined input is processed through a transformer model (typically BERT), and a classification head on top of the [CLS] token outputs a relevance score (0 to 1). Because all attention layers see both the query and document simultaneously, the model captures fine-grained interactions between specific query terms and document passages. This full cross-attention is what gives cross-encoders their superior accuracy.

Technical Details

Cross-encoders are typically fine-tuned from pretrained language models (BERT, RoBERTa, DeBERTa) using binary cross-entropy or margin-based ranking losses on labeled query-document pairs. The input format is '[CLS] query [SEP] document [SEP]' with a maximum total length of 512 tokens (or longer with extended-context models). Inference requires one forward pass per query-document pair, making it O(n) per query where n is the number of candidate documents. This cost restricts cross-encoders to reranking 100-1000 candidates retrieved by a faster first-stage method.

Best Practices

Use cross-encoders exclusively as rerankers on top of a fast first-stage retriever (BM25, dense retrieval)
Limit reranking to the top 100-500 candidates to keep latency under control
Fine-tune on domain-specific query-document pairs for the best relevance accuracy
Use distillation from cross-encoders to improve the quality of your first-stage bi-encoder
Consider DeBERTa-v3 as the backbone for better cross-attention performance than BERT

Common Pitfalls

Trying to use a cross-encoder for initial retrieval over a full corpus, which is prohibitively slow
Not truncating long documents to fit the model's context window, causing silent information loss
Using a cross-encoder trained on one domain (e.g., web search) for a very different domain (e.g., medical) without fine-tuning
Ignoring the latency impact of reranking too many candidates in a production setting
Confusing cross-encoder scores with calibrated probabilities without proper calibration

Advanced Tips

Use cross-encoder scores as soft labels to distill knowledge into faster bi-encoder or ColBERT models
Implement batched inference with GPU parallelism to rerank 100+ candidates in under 50ms
Apply cross-encoders for data labeling and evaluation when ground-truth relevance labels are expensive
Explore lightweight cross-encoder architectures (TinyBERT, MiniLM) for lower-latency reranking
Use ensemble cross-encoders (averaging scores from multiple models) for the highest reranking accuracy

Related Terms

ACID API Blob Storage CLIP Embedding