The Problem: A Ranked List Is a Hypothesis, Not an Answer
Every time a retrieval system returns a ranked list, it is making a claim: *these ten items, in this order, are the best response to this query.* That claim is a hypothesis. The embeddings, the BM25 scores, the fusion weights, the reranker — all of it is a model's best guess about relevance. None of it is ground truth.
The only ground truth about whether retrieval was actually good is what happens after the list is shown. Did the user click the third result and stop, satisfied? Did the agent's downstream tool call succeed because the retrieved chunk contained the right fact? Did someone scroll past all ten and reformulate the query in frustration? Interactions are the corpus's verdict on the ranker's hypothesis.
A retrieval system that never looks at this verdict is flying blind. It will rank the same way for a query whether its top result delights every user or annoys every user, because nothing connects the outcome back to the ranking. A feedback loop is the machinery that closes that gap: log what was shown, log what happened, and feed the difference back into how the system ranks next time.
This guide explains the real mechanics of doing that correctly. It is harder than "clicked equals good," because the act of showing a list distorts the very signal you want to learn from. We will cover the signals available, the biases that corrupt them, the click models that formalize those biases, the counterfactual learning-to-rank techniques that debias them, how to close the loop in practice, and why AI agents make feedback loops more powerful — and more dangerous — than classic human search.
Implicit vs Explicit Feedback
Feedback comes in two flavors, and a serious system uses both.
Explicit feedback is a deliberate judgment: a thumbs up, a star rating, a "this answer was helpful" button, a human-labeled relevance grade. It is high quality and unambiguous, but expensive and sparse. People rarely rate things, and the ones who do are a biased sample (very happy or very angry). You cannot build a training set of millions of explicit labels from production traffic alone.
Implicit feedback is behavior you observe as a side effect of normal use, where the user never intended to send a signal:
| Signal | What it suggests | Caveat |
| Click | The result looked relevant from its snippet | Could be clickbait; says nothing about the result's actual content |
| Dwell time | Long dwell suggests the content satisfied the need | Long dwell can also mean confusing content the user struggled with |
| Scroll depth / skip | Skipping high-ranked items suggests they looked irrelevant | The user may simply not have examined them |
| Reformulation | A new query right after suggests the results failed | Could be the user exploring a related sub-question |
| Downstream task success | The retrieved item enabled the next step to succeed | Hardest to attribute, but the most valuable signal |
Why "Clicked = Relevant" Is Wrong
The seductive shortcut is to treat a click as a positive label and a non-click as a negative label, then train a ranker to predict clicks. This fails because clicks are produced by the *interaction between* relevance and presentation, not by relevance alone. Three biases dominate.
Position Bias
Items shown higher in a list get more clicks regardless of their relevance, simply because users examine the top of a list more than the bottom. This is the single largest distortion in implicit feedback. The item at rank 1 might get 30% of clicks and the item at rank 10 might get 2% even if they are equally relevant, purely because far fewer people ever looked at rank 10.
If you train on raw clicks, you will conclude that whatever you happened to rank first is the most relevant — which is circular. The ranker reinforces its own past decisions: it ranked something first, it got clicks because it was first, you learn "this is great," you rank it first again. Position bias turns a feedback loop into a self-fulfilling prophecy.
Presentation and Trust Bias
Users trust the system. They assume the top results are the best, so they click them *because* they are at the top, not because the snippet was more compelling. They also click results with richer presentation — a thumbnail, a more detailed snippet, a recognizable brand — independent of relevance. The presentation of a result is a confound mixed into every click.
Selection Bias
You only observe feedback on what you actually showed. If a genuinely perfect result sat at rank 50 and you only displayed the top 10, it received zero clicks — not because it is bad, but because it was never given a chance. Your logs are a biased sample of the universe of possible results: they over-represent what your current ranker already favors and are silent about everything it suppressed. Training naively on this sample bakes the current ranker's blind spots into the next one.
The combined effect: raw click counts measure your ranker's past behavior at least as much as they measure relevance. To learn relevance, you must model and remove the behavior.
Click Models: Formalizing How Clicks Happen
Click models are probabilistic descriptions of the user's click decision. They separate *did the user examine this position* from *was the item relevant*, so you can reason about each independently. Two foundational models matter.
The Position-Based Model (PBM)
The PBM rests on the examination hypothesis: a result is clicked only if it is both *examined* and *relevant*, and whether it is examined depends only on its rank, not its content.
\`\`\`text P(click
| item d at rank k) = P(examine | rank k) × P(relevant |
Call \`P(examine
| rank k)\` the propensity at rank k — written \`p_k\`. It captures how likely a user is to even look at position k, and it falls off steeply with depth. \`P(relevant |
The power of the PBM is that if you can *estimate* \`p_k\` independently, you can divide it out and recover an unbiased estimate of relevance from biased clicks. That is the entire idea behind counterfactual LTR, below.
How do you estimate \`p_k\`? The cleanest way is a small amount of deliberate result randomization (e.g., randomly swapping pairs of adjacent positions for a slice of traffic): if you sometimes show the same item at different ranks, the difference in its click rate across ranks reveals the examination curve. There are also intervention-free estimators (EM-based) that jointly fit propensities and relevance from logs, at the cost of stronger assumptions.
The Cascade Model
The PBM assumes examination depends only on rank. The cascade model assumes something more behavioral: the user reads the list top-to-bottom, examines each item in turn, and stops as soon as they click something satisfying. Examination of rank k therefore depends on not having been satisfied by ranks 1..k-1.
\`\`\`text P(examine rank 1) = 1 P(examine rank k) = P(examine rank k-1) × (1 − P(click and satisfied at rank k-1)) \`\`\`
The cascade model explains a pattern the PBM cannot: a non-click on a high-ranked item is strong evidence of irrelevance, because the user almost certainly examined it (they had to pass it to reach what they clicked). It naturally produces the intuition that "the items above the clicked item were seen and rejected" — which gives you free negative signals, not just positives. Its weakness is that it only cleanly models a single click; richer variants (DCM, DBN) extend it to multiple clicks and to a distinction between click-satisfaction and post-click satisfaction.
The practical takeaway: you do not have to pick the "true" model. You pick the one whose assumptions best match your interface (a long scrolling feed leans PBM; a short top-of-list answer leans cascade), use it to assign propensities and inferred labels to your logged interactions, and feed those into training instead of raw clicks.
Counterfactual Learning-to-Rank: Debiasing With Propensities
The central trick of unbiased learning-to-rank is inverse-propensity weighting (IPW). If a click on an item at rank k is partly an accident of that item being at an easy-to-examine position, then weight that click by the inverse of its examination propensity. Clicks that happened in hard-to-see positions count for *more* (they overcame low examination probability, so the item must be quite relevant); clicks at the very top count for *less* (they were partly handed to the item by position).
For a ranker that we want to learn, the IPW-corrected empirical loss over logged clicks is:
\`\`\`text L_IPW(ranker) = sum over logged clicks (query q, clicked item d at rank k) of ( loss(ranker, q, d) / p_k ) \`\`\`
Dividing each clicked example by \`p_k\` (its examination propensity) makes the expected loss an unbiased estimate of the loss you would have measured if every item had been examined equally. In other words, IPW lets you train on biased logs as if they were unbiased, *provided your propensity estimates are good and every shown item had a nonzero chance of being examined.* That nonzero-propensity requirement is why a little randomization is so valuable — it guarantees no position is invisible.
This is "counterfactual" because you are estimating what *would* have happened under a different ranking than the one that produced your logs. You are reusing data collected by an old policy (the ranker that generated the logs) to evaluate and improve a new policy — the same logic as off-policy evaluation in reinforcement learning.
Offline vs Online Learning-to-Rank
With debiased labels in hand, how do you actually fit a ranking function? Two regimes.
Offline LTR trains a model on a fixed batch of logged, propensity-weighted interactions. The classic formulations differ by what unit they optimize:
Offline LTR is stable and reproducible but always one retraining cycle behind reality.
Online LTR updates ranking behavior continuously from a live stream of interactions, rather than in periodic batches. It ranges from full online algorithms that perturb-and-learn from each interaction, to far lighter-weight adjustments — nudging fusion weights, applying per-query boosts, or running a multi-armed bandit that learns which ranking variant earns the best outcomes. Online methods adapt fast and handle drift, but need guardrails (exploration limits, propensity floors) so they do not chase noise or amplify their own biases.
Most mature systems combine both: a heavyweight reranker retrained offline on debiased logs, with a lightweight online layer that adapts fusion weights or boosts within and across sessions between retrainings.
Closing the Loop in Practice
The algorithms only work if you log the right things. The single most common reason a feedback loop fails is incomplete logging — you cannot debias position if you never recorded position.
What to Log
For every result a user or agent interacts with, capture:
1. The exact query — the text and any filters, ideally a snapshot, because the same string can mean different things under different filters. 2. The result set as shown, with positions. Position is not optional. Without the rank of each item you cannot estimate or apply propensities, and your debiasing is dead on arrival. 3. An execution identifier that ties the interaction back to the specific retrieval that produced the list — so you know exactly which ranker version, which stages, and which scores generated what the user saw. 4. The original retrieval score of the interacted item, so you can relate model confidence to observed outcomes (see calibrating similarity scores). 5. The interaction itself — type (impression, click, dwell, positive/negative feedback, downstream success), and a timestamp. 6. Session and user identifiers where available, so within-session adaptation and per-user personalization are possible.
Building Training Pairs
From well-formed logs, the cascade intuition gives you supervised pairs almost for free. For a click at rank k, every examined-but-not-clicked item above it (ranks 1..k-1, which the cascade model says were seen) becomes a negative relative to the clicked item: *"clicked item should rank above this skipped-above item."* Weight each pair by inverse propensity. Feed the pairs to a pairwise learner like RankNet, or aggregate into listwise targets for LambdaMART.
Retraining vs Lightweight Online Adjustment
You have two levers, and they operate on different timescales:
Evaluation: Do Not Trust CTR Alone
Here is the trap that closes the loop incorrectly: you change the ranker, click-through rate goes up, you declare victory. But CTR can rise simply because you moved clickbait higher or because position bias rewarded a reshuffle — without any real relevance gain. Two disciplined approaches:
The rule: a single up-and-to-the-right CTR chart proves almost nothing. Outcome metrics plus bias-aware comparison prove something.
The Agent Angle: Richer Feedback, Sharper Loops
Everything above was developed for human web search. AI agents change the economics of feedback loops in three important ways.
Agents generate dense, structured, automatic feedback. A human emits a sparse, ambiguous click. An agent emits a *verifiable outcome*: the retrieved chunk either contained the fact that let the next tool call succeed, or it did not; the answer either grounded out against a source, or it hallucinated and got rejected by a checker. These are stronger labels than clicks — closer to explicit relevance judgments — and the agent can log them itself, with no human in the loop. An agentic system can therefore build a high-quality training set from its own operation, continuously. This is why feedback loops are *more* powerful for agentic retrieval than for human search: the supervision is richer and self-generating. For how agents issue and reason about retrieval differently, see agentic retrieval and multi-stage retrieval.
Agents make the control plane learnable. Because an agent's choices (which stages to run, how to fuse, how aggressively to rerank) are explicit and logged, outcome feedback can tune the *policy*, not just the relevance model. This is the natural home of online adjustment — the retrieval control plane can learn its own fusion weights and stage budgets from downstream task success.
But the dangers are sharper too. Feedback loops amplify whatever they reward. Two guardrails matter especially for agents:
The mental model for an agentic feedback loop:
1. Retrieve a ranked list (a hypothesis) with positions recorded. 2. Act on it — the agent uses results and produces a verifiable outcome. 3. Log the interaction tied to the exact result, position, score, and execution. 4. Debias the logs with a click model and inverse-propensity weighting. 5. Learn — retrain the reranker offline and/or adjust fusion weights online. 6. Guard — handle cold start, debias for position, explore, and diversify so the loop improves relevance instead of amplifying its own bias.
Mapping This to Mixpeek
Mixpeek closes this loop with a first-class retriever interactions capability: you log an interaction against the exact result a retriever returned, and that signal feeds the retriever's learned-fusion ranking. The two halves of the loop — *execute a retriever* and *log what happened* — are explicitly linked by an execution identifier and the result's feature id, so position, score, and ranker version are all preserved for debiasing.
\`\`\`bash pip install mixpeek \`\`\`
The agent runs a retriever, acts on the results, then logs the outcome of each result it used. Position and the execution id are carried through automatically so the feedback can be debiased downstream:
\`\`\`python from mixpeek import Mixpeek
mx = Mixpeek(api_key="YOUR_API_KEY")
# 1. Execute a retriever. The response carries an execution_id that ties # every logged interaction back to this exact ranked list. execution = mx.retrievers.execute( retriever_id="ret_support_kb", query="how do I rotate an expired API key without downtime?", return_fields=["document_id", "score", "feature_uri", "text"], )
results = execution["results"] # ranked list — index == position execution_id = execution["execution_id"]
# 2. The agent acts on the list and observes a *verifiable* outcome: # which retrieved chunk actually grounded the answer it returned. grounding_chunk = pick_chunk_that_grounded_the_answer(results) # your logic grounding_rank = results.index(grounding_chunk)
# 3. Log the outcome tied to the exact result. position, document_score, # feature_uri and execution_id are what make the signal debiasable and # usable by learned fusion. mx.interactions.create( feature_id=grounding_chunk["document_id"], interaction_type=["positive_feedback"], # a list — multiple signals allowed position=grounding_rank, retriever_id="ret_support_kb", execution_id=execution_id, document_score=grounding_chunk["score"], feature_uri=grounding_chunk["feature_uri"], # required for fusion weight learning session_id="agent_sess_42", query_snapshot={"text": "how do I rotate an expired API key without downtime?"}, )
# 4. Log the items ranked ABOVE the one that grounded the answer as # examined-but-rejected — the cascade-model negatives that teach the # reranker the grounding chunk should have ranked higher. for rank, item in enumerate(results[:grounding_rank]): mx.interactions.create( feature_id=item["document_id"], interaction_type=["skip"], position=rank, retriever_id="ret_support_kb", execution_id=execution_id, document_score=item["score"], feature_uri=item["feature_uri"], session_id="agent_sess_42", ) \`\`\`
Because every interaction records \`position\`, \`document_score\`, and the \`execution_id\`, the platform has exactly what counterfactual LTR needs: which item was shown where, with what confidence, by which retriever execution. The \`feature_uri\` lets the signal flow back into the retriever's learned fusion so the weights blending its stages adjust toward what actually grounded answers — the lightweight online half of the loop — while the accumulated, position-aware log is the training set for periodic reranker retraining, the heavyweight half.
To backfill historical logs in one shot rather than one call at a time, send them in bulk; to tune which signals count and how strongly, configure the retriever's reward mapping. For where the retrieved scores come from and how to normalize them before fusion, see hybrid search fusion; for the reranker that consumes the learned labels, see cross-encoder reranking. To configure the features and extractors whose ids you log against, see extractors and MVS; for the data sources powering ranking, browse the curated lists of hybrid search engines, multimodal RAG frameworks, and multimodal embedding models. For cost planning and full API details, see pricing and the docs.