Reasoning Rerankers: How Listwise LLM Rerankers Reorder Retrieval Results

Why Rerank At All: Recall First, Precision Second

A first-stage retriever — dense vectors, sparse BM25, or a hybrid of the two — has one job: do not lose the right answer. It scans millions of candidates and returns a few hundred, optimizing for recall. Its scoring is necessarily cheap, because it has to be applied to the whole corpus: a dot product between a query vector and each document vector, or a sum of term weights from an inverted index. That cheapness is also its limitation. A dense bi-encoder compresses an entire document into one fixed vector *before it has seen the query*, so it cannot attend to the specific phrase the query cares about. BM25 matches tokens but is blind to meaning. Either way, the top of the first-stage list is approximately right but rarely precisely ordered.

Reranking is the second stage, and it has the opposite job: take the small candidate set the retriever returned — the top 50, 100, or 200 — and reorder it for precision, so the genuinely best items rise to the very top. Because it runs over hundreds of candidates instead of millions, a reranker can afford to be far more expensive per item. This is the two-stage retrieve → rerank pattern, and it is the backbone of almost every serious retrieval system. The first stage trades precision for the recall and speed it needs at corpus scale; the second stage spends a much larger compute budget on a tiny slice to buy back the precision. For the broader staged pipeline an agent runs, see multi-stage retrieval.

This guide is about the *most* expensive and most capable end of the reranking spectrum: rerankers that consider the whole candidate list jointly and, increasingly, that reason before they rank. They are mechanically different from the pointwise cross-encoders covered in cross-encoder reranking — that guide is required background, and this guide deliberately contrasts against it rather than repeating it.

The Reranker Spectrum: Pointwise, Pairwise, Listwise

There are three fundamentally different ways to turn a query and a set of candidates into an ordering. They differ in what unit the model looks at when it makes a decision, and that single difference drives everything else — accuracy, latency, and the kinds of relevance signal each can capture.

Pointwise

A pointwise reranker scores each document independently: feed it (query, document₁), get a relevance score; feed it (query, document₂), get another score; sort by score. The cross-encoder is the canonical pointwise reranker — it concatenates the query and one document into a single sequence, runs full cross-attention so every query token can attend to every document token, and emits one number. This is genuinely more precise than a bi-encoder because the query and document interact during computation, but the model never sees more than one candidate at a time. Each score is an absolute judgment made in isolation.

Property

Pointwise (cross-encoder)

Input unit	one (query, document) pair
Output	one absolute relevance score per document
Calls per query	N (one per candidate) — embarrassingly parallel
Sees other candidates?	No
Captures inter-document signal?	No

Pairwise

A pairwise reranker takes two documents and answers a comparison: *given this query, is A more relevant than B?* You then aggregate the pairwise verdicts into a full order (a sort, a tournament, or a Bradley–Terry-style aggregation). Pairwise framing often matches how preference data is actually collected — "this beat that" — and it sidesteps the hard problem of producing a calibrated absolute score. Its cost is the number of comparisons: a naive all-pairs scheme is O(N²), so practical pairwise rerankers use sorting-style schemes that need far fewer comparisons, or use pairwise prompts only to break ties near the top.

Listwise

A listwise reranker looks at the entire candidate set at once and produces an ordering over all of it in one shot. Instead of N independent scores, it emits a single permutation: "the right order is document 3, then 7, then 1, then …". This is the regime where LLM rerankers live, and it is qualitatively different from the other two.

Why Listwise Can See What Pointwise Cannot

A pointwise scorer is structurally incapable of using inter-document information, because it only ever sees one document. But relevance is often *relative*, and the comparison is exactly the signal that resolves the hard cases:

Redundancy. Three candidates say nearly the same thing. A pointwise model scores all three high and they cluster at the top, crowding out a fourth document that adds genuinely new information. A listwise model sees all three together and can recognize "I already have this fact; the novel one should rank higher." This is the same intuition behind diversity-aware retrieval (MMR/DPP), surfacing naturally inside the reranker.

Relative specificity. Two documents both mention "Q3 EMEA revenue." Seen alone, each looks relevant. Seen side by side, one is a passing reference and the other is the actual breakdown — and the model can only make that judgment by comparing them.

Calibration drift. Absolute pointwise scores are notoriously hard to compare across queries and documents (see calibrating similarity scores). A listwise model never has to commit to an absolute number; it only has to get the *order* right, which is what the metric actually rewards.

The cost of this power is that the model can no longer parallelize over candidates — by definition it must hold the whole list in one context. That tension between *seeing everything* and *fitting everything* is the central engineering problem of LLM listwise reranking.

LLM Listwise Rerankers: The Prompt Is the Ranking Function

The RankGPT line of work made a deceptively simple observation: a sufficiently capable LLM, given a query and a numbered list of passages, can just be asked to output the list in relevance order. No fine-tuning required — the ranking function is the prompt.

The mechanism, end to end:

1. Number the candidates. Take the first-stage results and label them with identifiers: [1], [2], [3], … each followed by its text (or, for multimodal content, a textual description / caption / transcript span). 2. Prompt for a permutation. Instruct the model: *"Rank these passages by their relevance to the query. Output only the identifiers in descending order of relevance."* 3. Parse the permutation. The model emits something like [3] > [1] > [7] > [2] > …. You parse that ordering and reorder your candidate list accordingly. The model never emits scores — it emits a *permutation*, which is precisely the listwise output.

A minimal prompt skeleton looks like this:

Query: "how do I rotate an expired API key without downtime?"

Passages:
[1] To delete an API key, open Settings > Keys and click Revoke...
[2] Key rotation is zero-downtime: create a new key, deploy it, then revoke the old key...
[3] API keys expire after 90 days by default; you can change the TTL in...
[4] Rate limits are enforced per API key across all endpoints...

Rank the passages by relevance to the query.
Output only identifiers, most relevant first, e.g. [2] > [4] > [1] > [3]

The model reasons over all four together and returns, say, [2] > [3] > [1] > [4] — promoting the zero-downtime-rotation passage above the generic expiry note, a distinction a pointwise scorer staring at each passage alone could easily miss.

Sliding Window: Reranking More Candidates Than Fit in Context

The whole point of listwise is to see every candidate at once — but you cannot fit 100 passages into a prompt and expect reliable ranking, both because of context limits and because LLM ranking quality degrades as the list grows. The standard fix is a sliding window that ranks the list in overlapping chunks, from the bottom up.

Suppose you have 100 candidates, a window size of 20, and a step of 10:

1. Take the last window — candidates 81–100 — and rerank those 20. 2. Slide the window up by the step: candidates 71–90. Crucially this window *overlaps* the previous one, so the best items that bubbled up from 81–100 are re-compared against the new entries. 3. Keep sliding toward the top. Each pass lets a strong candidate from deep in the list "bubble up" one window at a time, like a bubble sort whose comparison operator is an LLM.

candidates:  [1 ............................................. 100]
pass 1 window:                                  [81 ......... 100]   rerank
pass 2 window:                           [71 ........ 90]            rerank (overlap 81-90)
pass 3 window:                    [61 ....... 80]                    rerank (overlap 71-80)
...
final window: [1 ........ 20]                                        rerank

Bottom-up sliding is the right direction because you care most about getting the top of the final list right: the last window the model sees is the head of the list, and the overlap guarantees that anything excellent buried at the bottom has a path to climb all the way up. The tradeoff is that you now make multiple LLM calls per query (one per window), which is why window size, step, and how deep you slide are all budget knobs.

Permutation Generation Is Fragile: Input-Order Sensitivity

There is a subtle, important failure mode unique to listwise LLM rerankers: the output is sensitive to the input order of the candidates. Present the same passages in a different order and a real model will sometimes return a different ranking. This is a form of positional bias — the model has a mild tendency to favor items it saw first (or last) regardless of relevance, and it can anchor on the order you happened to feed it.

Two practical mitigations:

Order randomization / multiple passes. Shuffle the input order, run the rerank a few times, and aggregate the permutations (e.g. by average rank or a Borda count). This averages out position-induced noise at the cost of more calls.

Don't feed it a degenerate order. Feeding the candidates in first-stage score order can amplify anchoring (the model just trusts the retriever); feeding a neutral or shuffled order, combined with the sliding window's overlap, reduces the chance the model simply rubber-stamps the input. Robust pipelines treat input order as a hyperparameter, not an afterthought.

We will quantify these biases in the evaluation section — they are not hypothetical, and a reranker that ignores them can *lose* to the first-stage list it was supposed to improve.

Reasoning Rerankers: Emit a Trace, Then Rank

The 2026 generation of rerankers pushes listwise ranking one step further: reasoning rerankers emit an explicit reasoning trace *before* committing to a ranking. Instead of jumping straight to [2] > [3] > ..., the model first writes out *why* — "The query asks specifically about zero-downtime rotation. Passage [2] describes exactly that. Passage [3] is about expiry, related but not the asked-for procedure. Passage [1] is about deletion, the opposite of what's wanted..." — and only then produces the permutation. The reasoning conditions the ranking, the way chain-of-thought conditions a final answer.

Why this helps on hard queries: many relevance judgments require multi-step inference (the query implies a constraint the passage only partially satisfies; two passages are both on-topic but only one matches an implicit requirement). Forcing the model to articulate the comparison before ranking gives it the working space to get those judgments right, and the trace is auditable — you can see *why* a document was promoted.

Two families to know:

Qwen3-Reranker (Qwen/Qwen3-Reranker-8B) — an instruction-tuned reranker that supports task conditioning: you tell it *what makes a document relevant for your use case*, not just "rank by relevance." It currently tops MTEB-R reranking benchmarks. Note that it operates as a cross-encoder scorer with instruction conditioning rather than a pure permutation generator — the field blends these designs, and "reasoning reranker" describes a capability (reason-then-rank, task-conditioned) more than one fixed architecture.

Nemotron-style reasoning rerankers — decoder-based rerankers trained to produce a relevance judgment with an accompanying rationale, trading latency for accuracy on ambiguous queries.

The catch is that reasoning is not free: the trace is generated tokens, so a reasoning reranker spends even more compute per query than a plain listwise one. That cost is justified only when the query is genuinely hard — which is the budget argument of the next section.

Cost and Latency: When the Big Reranker Is Worth It

Listwise LLM rerankers and reasoning rerankers are the most accurate rerankers available, and also the slowest and most expensive. The cost structure is fundamentally different from a pointwise cross-encoder:

Pointwise cross-encoder

Listwise LLM reranker

Reasoning reranker

Compute per query	N parallel forward passes	1–several sequential LLM generations (sliding windows)	LLM generation plus reasoning tokens
Latency	low, parallelizable	high, generation is sequential	highest
Cost driver	model size × N	tokens in × tokens out × windows	tokens in × (tokens out + trace)
Typical accuracy	strong	stronger on hard/relative queries	strongest on ambiguous queries

The decisive constraints are that LLM generation is autoregressive and sequential (you cannot parallelize the tokens of one permutation), and that you pay for both the candidates you feed in *and* every token the model emits. A reasoning trace can be longer than the ranking it justifies.

Distillation: Buy LLM Quality at Cross-Encoder Speed

The standard way out of this cost is distillation: use the slow, expensive LLM listwise reranker as a *teacher* to generate high-quality ranking labels (permutations, or pairwise preferences derived from them) over your queries, then train a small, fast student — usually a pointwise cross-encoder — to reproduce the teacher's orderings. The student runs at cross-encoder latency and cost but inherits much of the teacher's ranking quality, because it learned from the teacher's listwise judgments rather than from raw clicks or sparse labels. RankGPT itself was shown to distill effectively into far smaller specialized rerankers. For the general technique of compressing a large model's behavior into a cheaper one, see embedding fine-tuning and distillation.

Distillation reframes the decision. You do not always choose between "cheap reranker" and "LLM reranker" *at query time* — you can pay the LLM cost once, offline, to bake its judgment into a cheap reranker you run on every query. The LLM reranker then stays in the loop only for the residual hard cases.

Budget-Aware Staging

When you do keep the expensive reranker online, the right pattern is to stage by budget: run the cheap reranker over the full candidate set, then escalate only the queries (or only the top slice of candidates) that warrant the LLM. A reasoning reranker over the top 20 of an already-cross-encoder-reranked list costs far less than running it over the raw 200, and captures most of the benefit. Allocating reranking spend per query, rather than spending uniformly, is exactly the discipline covered in budget-aware multi-vector retrieval; the same budget logic governs reranker selection. Late-interaction models like ColBERT occupy a useful middle ground — token-level matching that is more precise than a bi-encoder and cheaper than an LLM reranker; see late-interaction retrieval.

Evaluation and Failure Modes

A reranker is only worth its cost if it actually improves the order, and the way to know is to measure rerank lift: compute a ranking metric on the first-stage list, then on the reranked list, and look at the delta. The standard metric is nDCG@k (normalized discounted cumulative gain at cutoff k), which rewards putting highly relevant items near the top and discounts gains deeper in the list — exactly the precision-at-the-top that reranking targets. Report nDCG@10 (and Recall@k to confirm the reranker did not drop a good item it was handed) before and after. A reranker that does not move nDCG@10 is pure cost. For the full measurement discipline, see evaluating multimodal retrieval.

The failure modes specific to listwise LLM rerankers:

Positional bias. As covered above, the model can favor items by their input position. Detect it by reranking the same list under several input permutations and measuring how much the output order changes; a stable reranker barely moves. Mitigate with shuffling and aggregation.

List-length effects. Ranking quality degrades as the candidate list grows — too many items in one window and the model loses the thread. This is *why* the sliding window exists; tune window size to where quality holds, not to the maximum the context allows.

Hallucinated rankings. A generative reranker emits text, and text can be malformed: it can drop an identifier, invent an identifier that was not in the input ([27] when you only gave it 20), or repeat one. You must validate the permutation — every input id appears exactly once — and have a deterministic fallback (e.g. keep the first-stage order for any item the model omitted) rather than trusting the raw output. A reranker that silently drops candidates is worse than no reranker.

Regression below baseline. All of the above can combine so the reranked list is *worse* than the first-stage list. The before/after nDCG comparison is what catches this; never ship a reranker without it.

The Agent Angle: Pick the Reranker Per Query

An AI agent searching unstructured content does not have to commit to one reranker for every query — and it shouldn't. Reranking cost should scale with query difficulty, and an agent is in a position to decide difficulty per query:

Easy, unambiguous query ("invoice #4471") — the first-stage hybrid list is probably already right. Skip the LLM reranker; a cheap cross-encoder, or no rerank at all, is fine.

Hard, relative, or ambiguous query ("which of our incidents had a root cause similar to last month's outage?") — this needs inter-document comparison and possibly multi-step reasoning. Escalate to a listwise or reasoning reranker over the top candidates.

The agent can route on cheap signals: query length and specificity, the score gap at the top of the first-stage list (a flat distribution of near-tied scores means the order is uncertain and reranking will help; a sharp drop-off means the top is already clear), or an explicit difficulty classifier. This per-query selection is the reranking instance of the broader control-plane idea — spend compute where it changes the answer.

The loop closes when the agent feeds rerank outcomes back as feedback. Every reranked list is a hypothesis; what the agent does next (which result actually grounded the answer, which tool call succeeded) is the verdict. Logging those outcomes, with positions, turns rerank decisions into training signal — both for distilling a better cheap reranker and for learning *when* the expensive one was worth it. That is the subject of retrieval feedback loops. For how hybrid first-stage scores are fused before reranking, see hybrid search fusion and RRF.

The mental model for an agent's reranking decision:

1. Run the recall stage — dense/sparse/hybrid retrieval returns the top-k candidates and their scores. 2. Assess difficulty — query specificity and the top-of-list score gap estimate whether the order is already trustworthy. 3. Select a reranker by budget — none / cheap cross-encoder for easy queries; listwise or reasoning reranker for hard ones. 4. Validate the permutation — every candidate accounted for exactly once; fall back to first-stage order otherwise. 5. Measure lift — nDCG@k before vs after, so a reranker that regresses gets caught. 6. Log the outcome — feed what actually worked back, to distill cheaper rerankers and tune the routing.

Mapping This to Mixpeek

Mixpeek expresses retrieve→rerank as a multi-stage retriever: an ordered list of stages where an early recall stage casts a wide net and a later rerank stage sharpens the top. The recall stage is a feature_search over your indexed features; the rerank stage runs a reranker model (a pointwise cross-encoder like Qwen3-Reranker-8B or jina-reranker-v3) over only the candidates the recall stage returned. You compose the stages once; the agent just executes the retriever per query.

pip install mixpeek

Define a two-stage retriever — wide recall, then precision rerank over the top candidates only:

from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

# A multi-stage retriever: stage 1 recalls broadly, stage 2 reranks the top.
retriever = mx.retrievers.create(
    namespace="support_kb",
    retriever_name="kb_two_stage",
    stages=[
        {
            # Stage 1 — RECALL. Hybrid feature search, tuned for recall:
            # return a generous candidate pool, do not over-filter here.
            "stage_name": "recall",
            "stage_type": "feature_search",
            "parameters": {
                "query_field": "text",
                "limit": 100,            # candidate pool handed to the reranker
            },
        },
        {
            # Stage 2 — RERANK. Precision-sort the recall candidates.
            # The reranker only ever sees these 100, never the full corpus.
            "stage_name": "rerank",
            "stage_type": "rerank",
            "parameters": {
                "model": "Qwen/Qwen3-Reranker-8B",
                "rerank_field": "text",
                "top_k": 10,             # final list the agent acts on
            },
        },
    ],
)

An agent that wants to spend reranking budget only where it pays off can choose the retriever per query, using the top-of-list score gap from a cheap pass to decide whether the precision stage is worth it:

# Cheap recall-only pass first.
shortlist = mx.retrievers.execute(
    retriever_id="kb_recall_only",   # stage 1 only
    query=user_query,
    return_fields=["document_id", "score", "text"],
)["results"]

# Heuristic: if the top scores are tightly bunched, the order is uncertain
# and a reasoning rerank will help. If there is a sharp drop after rank 1,
# the answer is already clear — skip the expensive stage.
top = [r["score"] for r in shortlist[:5]]
order_is_uncertain = (top[0] - top[-1]) < 0.05

execution = mx.retrievers.execute(
    retriever_id="kb_two_stage" if order_is_uncertain else "kb_recall_only",
    query=user_query,
    return_fields=["document_id", "score", "text"],
)

results = execution["results"]   # ranked list — index == final position

Because the rerank stage runs only over the recall stage's output, you get listwise/cross-encoder precision at the top without paying reranker cost over the whole corpus — the two-stage budget split made concrete. To configure the features and extractors the recall stage searches over, see extractors and MVS; for which models slot into the rerank stage, browse the curated list of hybrid search engines and multimodal RAG frameworks. For cost planning across stages and full API details, see pricing and the docs.

Production Checklist

Keep the two stages honest: tune the recall stage for recall (a generous candidate pool), tune the rerank stage for precision (small top_k).

Start with a pointwise cross-encoder reranker; reach for a listwise/reasoning reranker only when relative or ambiguous queries demand inter-document comparison.

If you run an LLM listwise reranker over a long list, use a bottom-up sliding window with overlap; tune window size to where ranking quality holds, not to the context limit.

Treat input order as a hyperparameter — shuffle and aggregate across a few passes to suppress positional bias.

Always validate the permutation: every candidate appears exactly once; fall back to first-stage order for anything the model drops or hallucinates.

Measure rerank lift with nDCG@k before vs after (and Recall@k to confirm nothing good was dropped); never ship a reranker that does not move the metric.

Distill the expensive reranker into a cheap cross-encoder offline so most queries get LLM-quality order at cross-encoder cost.

For agents, route reranker choice by query difficulty and the top-of-list score gap, and log rerank outcomes to close the feedback loop.

Key Takeaways

First-stage retrieval optimizes recall over millions; reranking optimizes precision over the top-k candidates — the retrieve→rerank pattern lets each stage spend its compute where it counts.

The reranker spectrum runs pointwise (score each document alone — cross-encoders) → pairwise (compare pairs) → listwise (order the whole set at once); only listwise can use inter-document signal like redundancy and relative specificity.

LLM listwise rerankers (RankGPT-style) make the prompt the ranking function: feed a numbered candidate list, get back a permutation; the sliding window reranks more candidates than fit in context by bubbling strong items up from the bottom.

Reasoning rerankers (Qwen3-Reranker, Nemotron-style) emit a rationale before ranking, helping on ambiguous queries at the cost of extra generated tokens.

LLM rerankers are accurate but slow and expensive; distillation bakes their judgment into a cheap cross-encoder offline, and budget-aware staging keeps the expensive reranker online only for hard queries.

Watch positional bias, list-length degradation, and hallucinated/dropped identifiers; validate every permutation and measure nDCG@k lift before vs after, or a reranker can quietly do worse than the list it was handed.

An agent should pick the reranker per query — cheap for easy queries, reasoning reranker for hard ones — and feed rerank outcomes back as feedback to distill better rerankers and tune the routing.

Related Resources

Cross-Encoder Reranking -- the pointwise reranker this guide contrasts against

Multi-Stage Retrieval: How Agents Search Unstructured Data -- the staged pipeline reranking lives in

Late-Interaction Retrieval -- token-level matching between bi-encoders and LLM rerankers

Budget-Aware Multi-Vector Retrieval -- allocating compute per query, including reranker choice

Embedding Fine-Tuning and Distillation -- compressing a slow teacher into a fast student

Evaluating Multimodal Retrieval -- nDCG@k and the measurement discipline behind rerank lift

Calibrating Similarity Scores -- why absolute pointwise scores are hard to compare

Diversity-Aware Retrieval (MMR/DPP) -- the redundancy signal listwise rerankers can capture

Hybrid Search Fusion: RRF and Score Normalization -- fusing first-stage scores before reranking

Retrieval Feedback Loops -- feeding rerank outcomes back as training signal

Qwen3-Reranker-8B -- instruction-tuned reasoning reranker

Jina Reranker v3 -- a reranker that slots into the rerank stage

Best Hybrid Search Engines -- where staged ranking lives

Best Multimodal RAG Frameworks -- end-to-end systems that retrieve and rerank