NEWVectors or files. Pick a path.Start →

    What is Cross-Encoder

    Cross-Encoder - Joint query-document encoding model for precise relevance scoring

    A cross-encoder is a neural model that jointly processes a query and document together through a single transformer encoder to produce a relevance score. By allowing full attention between query and document tokens, cross-encoders achieve the highest accuracy in relevance ranking but at a significant computational cost that limits their use to reranking small candidate sets.

    How It Works

    A cross-encoder takes the concatenation of a query and document as input, separated by a special token (e.g., [SEP]). The combined input is processed through a transformer model (typically BERT), and a classification head on top of the [CLS] token outputs a relevance score (0 to 1). Because all attention layers see both the query and document simultaneously, the model captures fine-grained interactions between specific query terms and document passages. This full cross-attention is what gives cross-encoders their superior accuracy.

    Technical Details

    Cross-encoders are typically fine-tuned from pretrained language models (BERT, RoBERTa, DeBERTa) using binary cross-entropy or margin-based ranking losses on labeled query-document pairs. The input format is '[CLS] query [SEP] document [SEP]' with a maximum total length of 512 tokens (or longer with extended-context models). Inference requires one forward pass per query-document pair, making it O(n) per query where n is the number of candidate documents. This cost restricts cross-encoders to reranking 100-1000 candidates retrieved by a faster first-stage method.

    Best Practices

    • Use cross-encoders exclusively as rerankers on top of a fast first-stage retriever (BM25, dense retrieval)
    • Limit reranking to the top 100-500 candidates to keep latency under control
    • Fine-tune on domain-specific query-document pairs for the best relevance accuracy
    • Use distillation from cross-encoders to improve the quality of your first-stage bi-encoder
    • Consider DeBERTa-v3 as the backbone for better cross-attention performance than BERT

    Common Pitfalls

    • Trying to use a cross-encoder for initial retrieval over a full corpus, which is prohibitively slow
    • Not truncating long documents to fit the model's context window, causing silent information loss
    • Using a cross-encoder trained on one domain (e.g., web search) for a very different domain (e.g., medical) without fine-tuning
    • Ignoring the latency impact of reranking too many candidates in a production setting
    • Confusing cross-encoder scores with calibrated probabilities without proper calibration

    Advanced Tips

    • Use cross-encoder scores as soft labels to distill knowledge into faster bi-encoder or ColBERT models
    • Implement batched inference with GPU parallelism to rerank 100+ candidates in under 50ms
    • Apply cross-encoders for data labeling and evaluation when ground-truth relevance labels are expensive
    • Explore lightweight cross-encoder architectures (TinyBERT, MiniLM) for lower-latency reranking
    • Use ensemble cross-encoders (averaging scores from multiple models) for the highest reranking accuracy
    Managed Mixpeek

    Put multimodal search to work

    Connect a bucket and Mixpeek runs the whole multimodal search pipeline for you: extraction, indexing, and search over your own objects. No models to wire up, nothing to host.

    Start with Managed
    MVS · bring your own

    Already have vectors?

    Keep your embeddings on your own cloud and run dense, sparse, and BM25 search directly on object storage. First 1M vectors free.

    Start with MVS