Retrieval Feedback Loops: Learning to Rank from Clicks, Outcomes, and Agent Interactions

The retrieval feedback flywheel: retrieve, interact, learn, improve — clicks, skips, dwell time, and conversions become fusion weights, reranker training pairs, and index partitioning.

The Problem: A Ranked List Is a Hypothesis, Not an Answer

Every time a retrieval system returns a ranked list, it is making a claim: *these ten items, in this order, are the best response to this query.* That claim is a hypothesis. The embeddings, the BM25 scores, the fusion weights, the reranker — all of it is a model's best guess about relevance. None of it is ground truth.

The only ground truth about whether retrieval was actually good is what happens after the list is shown. Did the user click the third result and stop, satisfied? Did the agent's downstream tool call succeed because the retrieved chunk contained the right fact? Did someone scroll past all ten and reformulate the query in frustration? Interactions are the corpus's verdict on the ranker's hypothesis.

A retrieval system that never looks at this verdict is flying blind. It will rank the same way for a query whether its top result delights every user or annoys every user, because nothing connects the outcome back to the ranking. A feedback loop is the machinery that closes that gap: log what was shown, log what happened, and feed the difference back into how the system ranks next time.

This guide explains the real mechanics of doing that correctly. It is harder than "clicked equals good," because the act of showing a list distorts the very signal you want to learn from. We will cover the signals available, the biases that corrupt them, the click models that formalize those biases, the counterfactual learning-to-rank techniques that debias them, how to close the loop in practice, and why AI agents make feedback loops more powerful — and more dangerous — than classic human search.

Implicit vs Explicit Feedback

Feedback comes in two flavors, and a serious system uses both.

Explicit feedback is a deliberate judgment: a thumbs up, a star rating, a "this answer was helpful" button, a human-labeled relevance grade. It is high quality and unambiguous, but expensive and sparse. People rarely rate things, and the ones who do are a biased sample (very happy or very angry). You cannot build a training set of millions of explicit labels from production traffic alone.

Implicit feedback is behavior you observe as a side effect of normal use, where the user never intended to send a signal:

Signal

What it suggests

Caveat

Click	The result looked relevant from its snippet	Could be clickbait; says nothing about the result's actual content
Dwell time	Long dwell suggests the content satisfied the need	Long dwell can also mean confusing content the user struggled with
Scroll depth / skip	Skipping high-ranked items suggests they looked irrelevant	The user may simply not have examined them
Reformulation	A new query right after suggests the results failed	Could be the user exploring a related sub-question
Downstream task success	The retrieved item enabled the next step to succeed	Hardest to attribute, but the most valuable signal

Implicit feedback is abundant and free, which is why it powers most production learning-to-rank. But it is noisy and biased in ways that, if ignored, will teach your ranker the wrong lesson. The rest of this guide is largely about not getting fooled.

Why "Clicked = Relevant" Is Wrong

The seductive shortcut is to treat a click as a positive label and a non-click as a negative label, then train a ranker to predict clicks. This fails because clicks are produced by the *interaction between* relevance and presentation, not by relevance alone. Three biases dominate.

Position Bias

Items shown higher in a list get more clicks regardless of their relevance, simply because users examine the top of a list more than the bottom. This is the single largest distortion in implicit feedback. The item at rank 1 might get 30% of clicks and the item at rank 10 might get 2% even if they are equally relevant, purely because far fewer people ever looked at rank 10.

If you train on raw clicks, you will conclude that whatever you happened to rank first is the most relevant — which is circular. The ranker reinforces its own past decisions: it ranked something first, it got clicks because it was first, you learn "this is great," you rank it first again. Position bias turns a feedback loop into a self-fulfilling prophecy.

Presentation and Trust Bias

Users trust the system. They assume the top results are the best, so they click them *because* they are at the top, not because the snippet was more compelling. They also click results with richer presentation — a thumbnail, a more detailed snippet, a recognizable brand — independent of relevance. The presentation of a result is a confound mixed into every click.

Selection Bias

You only observe feedback on what you actually showed. If a genuinely perfect result sat at rank 50 and you only displayed the top 10, it received zero clicks — not because it is bad, but because it was never given a chance. Your logs are a biased sample of the universe of possible results: they over-represent what your current ranker already favors and are silent about everything it suppressed. Training naively on this sample bakes the current ranker's blind spots into the next one.

The combined effect: raw click counts measure your ranker's past behavior at least as much as they measure relevance. To learn relevance, you must model and remove the behavior.

Click Models: Formalizing How Clicks Happen

Click models are probabilistic descriptions of the user's click decision. They separate *did the user examine this position* from *was the item relevant*, so you can reason about each independently. Two foundational models matter.

The Position-Based Model (PBM)

The PBM rests on the examination hypothesis: a result is clicked only if it is both *examined* and *relevant*, and whether it is examined depends only on its rank, not its content.

\\\`text P(click

item d at rank k) = P(examine

rank k) × P(relevant

d) \\\`

Call \P(examine | rank k)\ the propensity at rank k — written \p_k\. It captures how likely a user is to even look at position k, and it falls off steeply with depth. \P(relevant | d)\ is the quantity you actually want to learn. The model says clicks are a *product* of the two, so a low click rate is ambiguous: it could be low relevance, or it could be a deep position the user never examined.

The power of the PBM is that if you can *estimate* \p_k\ independently, you can divide it out and recover an unbiased estimate of relevance from biased clicks. That is the entire idea behind counterfactual LTR, below.

How do you estimate \p_k\? The cleanest way is a small amount of deliberate result randomization (e.g., randomly swapping pairs of adjacent positions for a slice of traffic): if you sometimes show the same item at different ranks, the difference in its click rate across ranks reveals the examination curve. There are also intervention-free estimators (EM-based) that jointly fit propensities and relevance from logs, at the cost of stronger assumptions.

The Cascade Model

The PBM assumes examination depends only on rank. The cascade model assumes something more behavioral: the user reads the list top-to-bottom, examines each item in turn, and stops as soon as they click something satisfying. Examination of rank k therefore depends on not having been satisfied by ranks 1..k-1.

\\\`text P(examine rank 1) = 1 P(examine rank k) = P(examine rank k-1) × (1 − P(click and satisfied at rank k-1)) \\\`

The cascade model explains a pattern the PBM cannot: a non-click on a high-ranked item is strong evidence of irrelevance, because the user almost certainly examined it (they had to pass it to reach what they clicked). It naturally produces the intuition that "the items above the clicked item were seen and rejected" — which gives you free negative signals, not just positives. Its weakness is that it only cleanly models a single click; richer variants (DCM, DBN) extend it to multiple clicks and to a distinction between click-satisfaction and post-click satisfaction.

The practical takeaway: you do not have to pick the "true" model. You pick the one whose assumptions best match your interface (a long scrolling feed leans PBM; a short top-of-list answer leans cascade), use it to assign propensities and inferred labels to your logged interactions, and feed those into training instead of raw clicks.

Counterfactual Learning-to-Rank: Debiasing With Propensities

The central trick of unbiased learning-to-rank is inverse-propensity weighting (IPW). If a click on an item at rank k is partly an accident of that item being at an easy-to-examine position, then weight that click by the inverse of its examination propensity. Clicks that happened in hard-to-see positions count for *more* (they overcame low examination probability, so the item must be quite relevant); clicks at the very top count for *less* (they were partly handed to the item by position).

For a ranker that we want to learn, the IPW-corrected empirical loss over logged clicks is:

\\\`text L_IPW(ranker) = sum over logged clicks (query q, clicked item d at rank k) of ( loss(ranker, q, d) / p_k ) \\\`

Dividing each clicked example by \p_k\ (its examination propensity) makes the expected loss an unbiased estimate of the loss you would have measured if every item had been examined equally. In other words, IPW lets you train on biased logs as if they were unbiased, *provided your propensity estimates are good and every shown item had a nonzero chance of being examined.* That nonzero-propensity requirement is why a little randomization is so valuable — it guarantees no position is invisible.

This is "counterfactual" because you are estimating what *would* have happened under a different ranking than the one that produced your logs. You are reusing data collected by an old policy (the ranker that generated the logs) to evaluate and improve a new policy — the same logic as off-policy evaluation in reinforcement learning.

Offline vs Online Learning-to-Rank

With debiased labels in hand, how do you actually fit a ranking function? Two regimes.

Offline LTR trains a model on a fixed batch of logged, propensity-weighted interactions. The classic formulations differ by what unit they optimize:

Pointwise — predict an absolute relevance score per item (regression/classification). Simple, but ignores that ranking is about *relative* order.

Pairwise — learn from pairs: item A should rank above item B. RankNet is the canonical example, using a probabilistic pairwise loss. This matches click data well, since clicks naturally generate "clicked beat skipped-above-it" pairs.

Listwise — optimize a loss defined over the whole ranked list, directly targeting a ranking metric like NDCG. LambdaMART (gradient-boosted trees with LambdaRank gradients) has been the long-running production workhorse; listwise neural and, more recently, LLM-based listwise rankers push the same idea further.

Offline LTR is stable and reproducible but always one retraining cycle behind reality.

Online LTR updates ranking behavior continuously from a live stream of interactions, rather than in periodic batches. It ranges from full online algorithms that perturb-and-learn from each interaction, to far lighter-weight adjustments — nudging fusion weights, applying per-query boosts, or running a multi-armed bandit that learns which ranking variant earns the best outcomes. Online methods adapt fast and handle drift, but need guardrails (exploration limits, propensity floors) so they do not chase noise or amplify their own biases.

Most mature systems combine both: a heavyweight reranker retrained offline on debiased logs, with a lightweight online layer that adapts fusion weights or boosts within and across sessions between retrainings.

Closing the Loop in Practice

The algorithms only work if you log the right things. The single most common reason a feedback loop fails is incomplete logging — you cannot debias position if you never recorded position.

What to Log

For every result a user or agent interacts with, capture:

1. The exact query — the text and any filters, ideally a snapshot, because the same string can mean different things under different filters. 2. The result set as shown, with positions. Position is not optional. Without the rank of each item you cannot estimate or apply propensities, and your debiasing is dead on arrival. 3. An execution identifier that ties the interaction back to the specific retrieval that produced the list — so you know exactly which ranker version, which stages, and which scores generated what the user saw. 4. The original retrieval score of the interacted item, so you can relate model confidence to observed outcomes (see calibrating similarity scores). 5. The interaction itself — type (impression, click, dwell, positive/negative feedback, downstream success), and a timestamp. 6. Session and user identifiers where available, so within-session adaptation and per-user personalization are possible.

Building Training Pairs

From well-formed logs, the cascade intuition gives you supervised pairs almost for free. For a click at rank k, every examined-but-not-clicked item above it (ranks 1..k-1, which the cascade model says were seen) becomes a negative relative to the clicked item: *"clicked item should rank above this skipped-above item."* Weight each pair by inverse propensity. Feed the pairs to a pairwise learner like RankNet, or aggregate into listwise targets for LambdaMART.

Retraining vs Lightweight Online Adjustment

You have two levers, and they operate on different timescales:

Periodic reranker retraining. Accumulate debiased interactions, retrain the cross-encoder or LTR model on a schedule, validate offline, and ship. High capacity, slow cadence. This is where the heavy relevance learning happens; see cross-encoder reranking for the model that typically consumes these labels.

Lightweight online adjustment. Between retrainings, adapt cheap parameters in real time: the fusion weights that blend dense and sparse scores (see hybrid search fusion and RRF), per-query or per-segment boosts, or a bandit over ranking variants. Low capacity, fast cadence, immediate response to drift.

Evaluation: Do Not Trust CTR Alone

Here is the trap that closes the loop incorrectly: you change the ranker, click-through rate goes up, you declare victory. But CTR can rise simply because you moved clickbait higher or because position bias rewarded a reshuffle — without any real relevance gain. Two disciplined approaches:

Interleaving and A/B tests. Interleaving mixes results from two rankers into one list and attributes clicks to whichever ranker contributed the clicked item, controlling for position far better than comparing two separate lists. A/B testing splits traffic and compares outcome metrics (task success, not just clicks) with statistical rigor.

Offline counterfactual estimators. Before risking live traffic, use IPW-based estimators on logged data to predict how a candidate ranker *would* perform. This is the offline mirror of online interleaving and lets you reject bad rankers cheaply. For the broader measurement discipline, see evaluating multimodal retrieval.

The rule: a single up-and-to-the-right CTR chart proves almost nothing. Outcome metrics plus bias-aware comparison prove something.

The Agent Angle: Richer Feedback, Sharper Loops

Everything above was developed for human web search. AI agents change the economics of feedback loops in three important ways.

Agents generate dense, structured, automatic feedback. A human emits a sparse, ambiguous click. An agent emits a *verifiable outcome*: the retrieved chunk either contained the fact that let the next tool call succeed, or it did not; the answer either grounded out against a source, or it hallucinated and got rejected by a checker. These are stronger labels than clicks — closer to explicit relevance judgments — and the agent can log them itself, with no human in the loop. An agentic system can therefore build a high-quality training set from its own operation, continuously. This is why feedback loops are *more* powerful for agentic retrieval than for human search: the supervision is richer and self-generating. For how agents issue and reason about retrieval differently, see agentic retrieval and multi-stage retrieval.

Agents make the control plane learnable. Because an agent's choices (which stages to run, how to fuse, how aggressively to rerank) are explicit and logged, outcome feedback can tune the *policy*, not just the relevance model. This is the natural home of online adjustment — the retrieval control plane can learn its own fusion weights and stage budgets from downstream task success.

But the dangers are sharper too. Feedback loops amplify whatever they reward. Two guardrails matter especially for agents:

Cold start. A brand-new query, corpus, or retriever has no interaction history, so the loop has nothing to learn from and the system must fall back to its prior (the base embedding/fusion ranking) without pretending it has evidence. Do not let an empty-history ranker behave as if it were confident.

Popularity / rich-get-richer amplification. If the loop boosts whatever already gets interactions, it starves the long tail and entrenches early winners — the position-bias self-fulfilling prophecy at corpus scale. Counter it with the propensity debiasing above, with exploration (occasionally show and learn from lower-ranked candidates), and with diversity-aware ranking so the loop does not collapse onto a few items; see diversity-aware retrieval (MMR/DPP).

The mental model for an agentic feedback loop:

1. Retrieve a ranked list (a hypothesis) with positions recorded. 2. Act on it — the agent uses results and produces a verifiable outcome. 3. Log the interaction tied to the exact result, position, score, and execution. 4. Debias the logs with a click model and inverse-propensity weighting. 5. Learn — retrain the reranker offline and/or adjust fusion weights online. 6. Guard — handle cold start, debias for position, explore, and diversify so the loop improves relevance instead of amplifying its own bias.

Mapping This to Mixpeek

Mixpeek closes this loop with a first-class retriever interactions capability: you log an interaction against the exact result a retriever returned, and that signal feeds the retriever's learned-fusion ranking. The two halves of the loop — *execute a retriever* and *log what happened* — are explicitly linked by an execution identifier and the result's feature id, so position, score, and ranker version are all preserved for debiasing.

\\\`bash pip install mixpeek \\\`

The agent runs a retriever, acts on the results, then logs the outcome of each result it used. Position and the execution id are carried through automatically so the feedback can be debiased downstream:

\\\`python from mixpeek import Mixpeek

mx = Mixpeek(api_key="YOUR_API_KEY")

# 1. Execute a retriever. The response carries an execution_id that ties # every logged interaction back to this exact ranked list. execution = mx.retrievers.execute( retriever_id="ret_support_kb", query="how do I rotate an expired API key without downtime?", return_fields=["document_id", "score", "feature_uri", "text"], )

results = execution["results"] # ranked list — index == position execution_id = execution["execution_id"]

# 2. The agent acts on the list and observes a *verifiable* outcome: # which retrieved chunk actually grounded the answer it returned. grounding_chunk = pick_chunk_that_grounded_the_answer(results) # your logic grounding_rank = results.index(grounding_chunk)

# 3. Log the outcome tied to the exact result. position, document_score, # feature_uri and execution_id are what make the signal debiasable and # usable by learned fusion. mx.interactions.create( feature_id=grounding_chunk["document_id"], interaction_type=["positive_feedback"], # a list — multiple signals allowed position=grounding_rank, retriever_id="ret_support_kb", execution_id=execution_id, document_score=grounding_chunk["score"], feature_uri=grounding_chunk["feature_uri"], # required for fusion weight learning session_id="agent_sess_42", query_snapshot={"text": "how do I rotate an expired API key without downtime?"}, )

# 4. Log the items ranked ABOVE the one that grounded the answer as # examined-but-rejected — the cascade-model negatives that teach the # reranker the grounding chunk should have ranked higher. for rank, item in enumerate(results[:grounding_rank]): mx.interactions.create( feature_id=item["document_id"], interaction_type=["skip"], position=rank, retriever_id="ret_support_kb", execution_id=execution_id, document_score=item["score"], feature_uri=item["feature_uri"], session_id="agent_sess_42", ) \\\`

Because every interaction records \position\, \document_score\, and the \execution_id\, the platform has exactly what counterfactual LTR needs: which item was shown where, with what confidence, by which retriever execution. The \feature_uri\ lets the signal flow back into the retriever's learned fusion so the weights blending its stages adjust toward what actually grounded answers — the lightweight online half of the loop — while the accumulated, position-aware log is the training set for periodic reranker retraining, the heavyweight half.

To backfill historical logs in one shot rather than one call at a time, send them in bulk; to tune which signals count and how strongly, configure the retriever's reward mapping. For where the retrieved scores come from and how to normalize them before fusion, see hybrid search fusion; for the reranker that consumes the learned labels, see cross-encoder reranking. To configure the features and extractors whose ids you log against, see extractors and MVS; for the data sources powering ranking, browse the curated lists of hybrid search engines, multimodal RAG frameworks, and multimodal embedding models. For cost planning and full API details, see pricing and the docs.

Production Checklist

Log position for every shown result — without rank you cannot debias, and the whole loop collapses to circular reinforcement.

Tie each interaction to the exact retrieval execution (and ranker version), the original score, and the query snapshot.

Choose a click model that matches your interface (PBM for long lists, cascade for top-of-list answers) and estimate examination propensities, ideally with a little result randomization.

Apply inverse-propensity weighting before training; ensure no shown position has zero examination probability.

Use cascade-model negatives (examined-but-skipped items above a positive) to build training pairs, not just positives.

Split the loop into slow offline reranker retraining and fast online fusion-weight / boost adjustment.

Evaluate with interleaving, A/B on outcome metrics, and offline counterfactual estimators — never CTR alone.

For agents, log verifiable downstream outcomes (grounding/tool success), not just clicks — they are stronger labels.

Guard against cold start (fall back to the prior) and popularity amplification (debias, explore, diversify).

Key Takeaways

A ranked list is a hypothesis; interactions are the only ground truth about whether retrieval was good, and a system that ignores them cannot improve.

Raw clicks are corrupted by position bias, presentation/trust bias, and selection bias — "clicked = relevant" trains a ranker to reinforce its own past decisions.

Click models (the position-based model's examination hypothesis, the cascade model's read-until-satisfied behavior) separate examination from relevance so the bias can be removed.

Counterfactual LTR uses inverse-propensity weighting to recover unbiased relevance from biased logs, turning offline logged data into training signal for pointwise/pairwise (RankNet) or listwise (LambdaMART, LLM rerankers) learners, complemented by fast online adjustment.

Closing the loop in practice means logging query, positions, execution, and outcome; building debiased training pairs; retraining periodically while adjusting fusion weights online; and evaluating with interleaving and counterfactual estimators rather than CTR.

Agents make feedback loops sharper: they emit dense, verifiable, self-logged outcomes — but the same loop amplifies bias, so cold-start fallbacks, propensity debiasing, exploration, and diversity are mandatory guardrails.

Related Resources

Cross-Encoder Reranking -- the reranker that typically consumes learned relevance labels

Calibrating Similarity Scores -- turning raw scores into trustworthy confidences

Evaluating Multimodal Retrieval -- the measurement discipline behind any loop

Hybrid Search Fusion: RRF and Score Normalization -- the fusion weights an online loop tunes

Multi-Stage Retrieval: How Agents Search Unstructured Data -- the staged pipeline feedback flows through

Agentic Retrieval: How Agents Search Differently -- why agent feedback is richer than human clicks

Retrieval Control Planes for AI Agents -- the learnable policy layer

Diversity-Aware Retrieval (MMR/DPP) -- countering popularity amplification

Best Hybrid Search Engines -- where ranking and feedback live

Best Multimodal RAG Frameworks -- end-to-end systems that close the loop

Best Multimodal Embedding Models -- the base ranking the loop refines