Why Rerank At All: Recall First, Precision Second
A first-stage retriever — dense vectors, sparse BM25, or a hybrid of the two — has one job: do not lose the right answer. It scans millions of candidates and returns a few hundred, optimizing for recall. Its scoring is necessarily cheap, because it has to be applied to the whole corpus: a dot product between a query vector and each document vector, or a sum of term weights from an inverted index. That cheapness is also its limitation. A dense bi-encoder compresses an entire document into one fixed vector *before it has seen the query*, so it cannot attend to the specific phrase the query cares about. BM25 matches tokens but is blind to meaning. Either way, the top of the first-stage list is approximately right but rarely precisely ordered.
Reranking is the second stage, and it has the opposite job: take the small candidate set the retriever returned — the top 50, 100, or 200 — and reorder it for precision, so the genuinely best items rise to the very top. Because it runs over hundreds of candidates instead of millions, a reranker can afford to be far more expensive per item. This is the two-stage retrieve → rerank pattern, and it is the backbone of almost every serious retrieval system. The first stage trades precision for the recall and speed it needs at corpus scale; the second stage spends a much larger compute budget on a tiny slice to buy back the precision. For the broader staged pipeline an agent runs, see multi-stage retrieval.
This guide is about the *most* expensive and most capable end of the reranking spectrum: rerankers that consider the whole candidate list jointly and, increasingly, that reason before they rank. They are mechanically different from the pointwise cross-encoders covered in cross-encoder reranking — that guide is required background, and this guide deliberately contrasts against it rather than repeating it.
The Reranker Spectrum: Pointwise, Pairwise, Listwise
There are three fundamentally different ways to turn a query and a set of candidates into an ordering. They differ in what unit the model looks at when it makes a decision, and that single difference drives everything else — accuracy, latency, and the kinds of relevance signal each can capture.
Pointwise
A pointwise reranker scores each document independently: feed it (query, document₁), get a relevance score; feed it (query, document₂), get another score; sort by score. The cross-encoder is the canonical pointwise reranker — it concatenates the query and one document into a single sequence, runs full cross-attention so every query token can attend to every document token, and emits one number. This is genuinely more precise than a bi-encoder because the query and document interact during computation, but the model never sees more than one candidate at a time. Each score is an absolute judgment made in isolation.
| Property | Pointwise (cross-encoder) |
| Input unit | one (query, document) pair |
| Output | one absolute relevance score per document |
| Calls per query | N (one per candidate) — embarrassingly parallel |
| Sees other candidates? | No |
| Captures inter-document signal? | No |
Pairwise
A pairwise reranker takes two documents and answers a comparison: *given this query, is A more relevant than B?* You then aggregate the pairwise verdicts into a full order (a sort, a tournament, or a Bradley–Terry-style aggregation). Pairwise framing often matches how preference data is actually collected — "this beat that" — and it sidesteps the hard problem of producing a calibrated absolute score. Its cost is the number of comparisons: a naive all-pairs scheme is O(N²), so practical pairwise rerankers use sorting-style schemes that need far fewer comparisons, or use pairwise prompts only to break ties near the top.
Listwise
A listwise reranker looks at the entire candidate set at once and produces an ordering over all of it in one shot. Instead of N independent scores, it emits a single permutation: "the right order is document 3, then 7, then 1, then …". This is the regime where LLM rerankers live, and it is qualitatively different from the other two.
Why Listwise Can See What Pointwise Cannot
A pointwise scorer is structurally incapable of using inter-document information, because it only ever sees one document. But relevance is often *relative*, and the comparison is exactly the signal that resolves the hard cases:
The cost of this power is that the model can no longer parallelize over candidates — by definition it must hold the whole list in one context. That tension between *seeing everything* and *fitting everything* is the central engineering problem of LLM listwise reranking.
LLM Listwise Rerankers: The Prompt Is the Ranking Function
The RankGPT line of work made a deceptively simple observation: a sufficiently capable LLM, given a query and a numbered list of passages, can just be asked to output the list in relevance order. No fine-tuning required — the ranking function is the prompt.
The mechanism, end to end:
1. Number the candidates. Take the first-stage results and label them with identifiers: `[1]`, `[2]`, `[3]`, … each followed by its text (or, for multimodal content, a textual description / caption / transcript span). 2. Prompt for a permutation. Instruct the model: *"Rank these passages by their relevance to the query. Output only the identifiers in descending order of relevance."* 3. Parse the permutation. The model emits something like `[3] > [1] > [7] > [2] > …`. You parse that ordering and reorder your candidate list accordingly. The model never emits scores — it emits a *permutation*, which is precisely the listwise output.
A minimal prompt skeleton looks like this:
Query: "how do I rotate an expired API key without downtime?"
Passages:
[1] To delete an API key, open Settings > Keys and click Revoke...
[2] Key rotation is zero-downtime: create a new key, deploy it, then revoke the old key...
[3] API keys expire after 90 days by default; you can change the TTL in...
[4] Rate limits are enforced per API key across all endpoints...
Rank the passages by relevance to the query.
Output only identifiers, most relevant first, e.g. [2] > [4] > [1] > [3]
The model reasons over all four together and returns, say, `[2] > [3] > [1] > [4]` — promoting the zero-downtime-rotation passage above the generic expiry note, a distinction a pointwise scorer staring at each passage alone could easily miss.
Sliding Window: Reranking More Candidates Than Fit in Context
The whole point of listwise is to see every candidate at once — but you cannot fit 100 passages into a prompt and expect reliable ranking, both because of context limits and because LLM ranking quality degrades as the list grows. The standard fix is a sliding window that ranks the list in overlapping chunks, from the bottom up.
Suppose you have 100 candidates, a window size of 20, and a step of 10:
1. Take the last window — candidates 81–100 — and rerank those 20. 2. Slide the window up by the step: candidates 71–90. Crucially this window *overlaps* the previous one, so the best items that bubbled up from 81–100 are re-compared against the new entries. 3. Keep sliding toward the top. Each pass lets a strong candidate from deep in the list "bubble up" one window at a time, like a bubble sort whose comparison operator is an LLM.
candidates: [1 ............................................. 100]
pass 1 window: [81 ......... 100] rerank
pass 2 window: [71 ........ 90] rerank (overlap 81-90)
pass 3 window: [61 ....... 80] rerank (overlap 71-80)
...
final window: [1 ........ 20] rerank
Bottom-up sliding is the right direction because you care most about getting the top of the final list right: the last window the model sees is the head of the list, and the overlap guarantees that anything excellent buried at the bottom has a path to climb all the way up. The tradeoff is that you now make multiple LLM calls per query (one per window), which is why window size, step, and how deep you slide are all budget knobs.
Permutation Generation Is Fragile: Input-Order Sensitivity
There is a subtle, important failure mode unique to listwise LLM rerankers: the output is sensitive to the input order of the candidates. Present the same passages in a different order and a real model will sometimes return a different ranking. This is a form of positional bias — the model has a mild tendency to favor items it saw first (or last) regardless of relevance, and it can anchor on the order you happened to feed it.
Two practical mitigations:
We will quantify these biases in the evaluation section — they are not hypothetical, and a reranker that ignores them can *lose* to the first-stage list it was supposed to improve.
Reasoning Rerankers: Emit a Trace, Then Rank
The 2026 generation of rerankers pushes listwise ranking one step further: reasoning rerankers emit an explicit reasoning trace *before* committing to a ranking. Instead of jumping straight to `[2] > [3] > ...`, the model first writes out *why* — "The query asks specifically about zero-downtime rotation. Passage [2] describes exactly that. Passage [3] is about expiry, related but not the asked-for procedure. Passage [1] is about deletion, the opposite of what's wanted..." — and only then produces the permutation. The reasoning conditions the ranking, the way chain-of-thought conditions a final answer.
Why this helps on hard queries: many relevance judgments require multi-step inference (the query implies a constraint the passage only partially satisfies; two passages are both on-topic but only one matches an implicit requirement). Forcing the model to articulate the comparison before ranking gives it the working space to get those judgments right, and the trace is auditable — you can see *why* a document was promoted.
Two families to know:
The catch is that reasoning is not free: the trace is generated tokens, so a reasoning reranker spends even more compute per query than a plain listwise one. That cost is justified only when the query is genuinely hard — which is the budget argument of the next section.
Cost and Latency: When the Big Reranker Is Worth It
Listwise LLM rerankers and reasoning rerankers are the most accurate rerankers available, and also the slowest and most expensive. The cost structure is fundamentally different from a pointwise cross-encoder:
| Pointwise cross-encoder | Listwise LLM reranker | Reasoning reranker |
| Compute per query | N parallel forward passes | 1–several sequential LLM generations (sliding windows) | LLM generation plus reasoning tokens |
| Latency | low, parallelizable | high, generation is sequential | highest |
| Cost driver | model size × N | tokens in × tokens out × windows | tokens in × (tokens out + trace) |
| Typical accuracy | strong | stronger on hard/relative queries | strongest on ambiguous queries |
Distillation: Buy LLM Quality at Cross-Encoder Speed
The standard way out of this cost is distillation: use the slow, expensive LLM listwise reranker as a *teacher* to generate high-quality ranking labels (permutations, or pairwise preferences derived from them) over your queries, then train a small, fast student — usually a pointwise cross-encoder — to reproduce the teacher's orderings. The student runs at cross-encoder latency and cost but inherits much of the teacher's ranking quality, because it learned from the teacher's listwise judgments rather than from raw clicks or sparse labels. RankGPT itself was shown to distill effectively into far smaller specialized rerankers. For the general technique of compressing a large model's behavior into a cheaper one, see embedding fine-tuning and distillation.
Distillation reframes the decision. You do not always choose between "cheap reranker" and "LLM reranker" *at query time* — you can pay the LLM cost once, offline, to bake its judgment into a cheap reranker you run on every query. The LLM reranker then stays in the loop only for the residual hard cases.
Budget-Aware Staging
When you do keep the expensive reranker online, the right pattern is to stage by budget: run the cheap reranker over the full candidate set, then escalate only the queries (or only the top slice of candidates) that warrant the LLM. A reasoning reranker over the top 20 of an already-cross-encoder-reranked list costs far less than running it over the raw 200, and captures most of the benefit. Allocating reranking spend per query, rather than spending uniformly, is exactly the discipline covered in budget-aware multi-vector retrieval; the same budget logic governs reranker selection. Late-interaction models like ColBERT occupy a useful middle ground — token-level matching that is more precise than a bi-encoder and cheaper than an LLM reranker; see late-interaction retrieval.
Evaluation and Failure Modes
A reranker is only worth its cost if it actually improves the order, and the way to know is to measure rerank lift: compute a ranking metric on the first-stage list, then on the reranked list, and look at the delta. The standard metric is nDCG@k (normalized discounted cumulative gain at cutoff k), which rewards putting highly relevant items near the top and discounts gains deeper in the list — exactly the precision-at-the-top that reranking targets. Report nDCG@10 (and Recall@k to confirm the reranker did not drop a good item it was handed) before and after. A reranker that does not move nDCG@10 is pure cost. For the full measurement discipline, see evaluating multimodal retrieval.
The failure modes specific to listwise LLM rerankers:
The Agent Angle: Pick the Reranker Per Query
An AI agent searching unstructured content does not have to commit to one reranker for every query — and it shouldn't. Reranking cost should scale with query difficulty, and an agent is in a position to decide difficulty per query:
The agent can route on cheap signals: query length and specificity, the score gap at the top of the first-stage list (a flat distribution of near-tied scores means the order is uncertain and reranking will help; a sharp drop-off means the top is already clear), or an explicit difficulty classifier. This per-query selection is the reranking instance of the broader control-plane idea — spend compute where it changes the answer.
The loop closes when the agent feeds rerank outcomes back as feedback. Every reranked list is a hypothesis; what the agent does next (which result actually grounded the answer, which tool call succeeded) is the verdict. Logging those outcomes, with positions, turns rerank decisions into training signal — both for distilling a better cheap reranker and for learning *when* the expensive one was worth it. That is the subject of retrieval feedback loops. For how hybrid first-stage scores are fused before reranking, see hybrid search fusion and RRF.
The mental model for an agent's reranking decision:
1. Run the recall stage — dense/sparse/hybrid retrieval returns the top-k candidates and their scores. 2. Assess difficulty — query specificity and the top-of-list score gap estimate whether the order is already trustworthy. 3. Select a reranker by budget — none / cheap cross-encoder for easy queries; listwise or reasoning reranker for hard ones. 4. Validate the permutation — every candidate accounted for exactly once; fall back to first-stage order otherwise. 5. Measure lift — nDCG@k before vs after, so a reranker that regresses gets caught. 6. Log the outcome — feed what actually worked back, to distill cheaper rerankers and tune the routing.
Mapping This to Mixpeek
Mixpeek expresses retrieve→rerank as a multi-stage retriever: an ordered list of stages where an early recall stage casts a wide net and a later rerank stage sharpens the top. The recall stage is a `feature_search` over your indexed features; the rerank stage runs a reranker model (a pointwise cross-encoder like Qwen3-Reranker-8B or jina-reranker-v3) over only the candidates the recall stage returned. You compose the stages once; the agent just executes the retriever per query.
pip install mixpeek
Define a two-stage retriever — wide recall, then precision rerank over the top candidates only:
from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
# A multi-stage retriever: stage 1 recalls broadly, stage 2 reranks the top.
retriever = mx.retrievers.create(
namespace="support_kb",
retriever_name="kb_two_stage",
stages=[
{
# Stage 1 — RECALL. Hybrid feature search, tuned for recall:
# return a generous candidate pool, do not over-filter here.
"stage_name": "recall",
"stage_type": "feature_search",
"parameters": {
"query_field": "text",
"limit": 100, # candidate pool handed to the reranker
},
},
{
# Stage 2 — RERANK. Precision-sort the recall candidates.
# The reranker only ever sees these 100, never the full corpus.
"stage_name": "rerank",
"stage_type": "rerank",
"parameters": {
"model": "Qwen/Qwen3-Reranker-8B",
"rerank_field": "text",
"top_k": 10, # final list the agent acts on
},
},
],
)
An agent that wants to spend reranking budget only where it pays off can choose the retriever per query, using the top-of-list score gap from a cheap pass to decide whether the precision stage is worth it:
# Cheap recall-only pass first.
shortlist = mx.retrievers.execute(
retriever_id="kb_recall_only", # stage 1 only
query=user_query,
return_fields=["document_id", "score", "text"],
)["results"]
# Heuristic: if the top scores are tightly bunched, the order is uncertain
# and a reasoning rerank will help. If there is a sharp drop after rank 1,
# the answer is already clear — skip the expensive stage.
top = [r["score"] for r in shortlist[:5]]
order_is_uncertain = (top[0] - top[-1]) < 0.05
execution = mx.retrievers.execute(
retriever_id="kb_two_stage" if order_is_uncertain else "kb_recall_only",
query=user_query,
return_fields=["document_id", "score", "text"],
)
results = execution["results"] # ranked list — index == final position
Because the rerank stage runs only over the recall stage's output, you get listwise/cross-encoder precision at the top without paying reranker cost over the whole corpus — the two-stage budget split made concrete. To configure the features and extractors the recall stage searches over, see extractors and MVS; for which models slot into the rerank stage, browse the curated list of hybrid search engines and multimodal RAG frameworks. For cost planning across stages and full API details, see pricing and the docs.