The Failure Nobody Looks At
When retrieval returns the wrong evidence, the reflex is to blame the index, the embedding model, or the reranker. But a large share of misses happen earlier, in the gap between the words the agent produced and the words the corpus was indexed with. The query "did the speaker walk back the Q3 guidance" will not land near a transcript chunk that says "we are revising our full-year outlook downward," even though they mean the same thing. No amount of reranking saves a first stage that never surfaced the right candidate.
This gap is called the vocabulary mismatch problem, and it predates embeddings: lexical search has fought it for decades and dense search only narrows it. The query side of the pipeline, everything that happens between the raw question and the vector or term that actually hits the index, is the cheapest place to fix retrieval, because it runs once per query instead of once per document. That stage is query transformation, and for an agent it is the difference between a question and a search.
For an agent reading unstructured content the problem is sharper than in plain text RAG, for three reasons:
A query transformation pipeline is a small sequence of steps that takes the raw input and produces one or more clean, routed, retrieval-ready queries. Here is the anatomy.
Step 1: Classify the Query
Before transforming anything, decide what kind of query this is, because the right transform depends on it. A lightweight classifier, a small LLM call or even a few heuristics, sorts the query into a handful of types:
The point of classification is to avoid applying an expensive transform where it hurts. Running HyDE on a navigational query that just needs a filter wastes a model call and adds noise. Routing is the first transform, and the cheapest one.
Step 2: Extract Intent From the Trace
When the input is an agent reasoning trace, the search need is a small fraction of the tokens. Embedding the whole trace dilutes it: the dense vector is dominated by accumulated context, and lexical search drowns in stopwords. Extract the current search intent first, then transform only that.
raw trace: "I already confirmed the index is HNSW and the data is 10M
vectors. Now I need to know the memory footprint of that
configuration so I can size the node."
intent: "HNSW index memory footprint at 10M vectors"
A 1B-parameter model is plenty for this, and many agent frameworks now emit a structured "current search intent" field alongside the trace so you skip the call entirely. Everything downstream operates on the intent, not the raw trace.
Step 3: Expand or Rewrite
Once you have a clean intent, you can close the vocabulary gap. Two families of techniques, with an important tradeoff between them.
Pseudo-relevance feedback (PRF) is the classic, model-free option. Run the query, take the top few results, harvest the terms or vectors that appear in them, and add them back into a second, expanded query. The assumption is that the first-pass top results are roughly relevant, so their vocabulary is the corpus's way of saying what the query meant. PRF is cheap and grounded in the actual corpus, but it has a known failure mode called query drift: if the first pass returns off-topic results, the expansion amplifies the error.
HyDE (Hypothetical Document Embeddings) flips the direction. Instead of embedding the question, you have a small model write a hypothetical answer to it, then embed that. The intuition is that answers live nearer to real documents in embedding space than questions do, so a fabricated answer is a better probe than the literal query.
query: "did the speaker walk back the Q3 guidance"
HyDE answer: "During the call, management revised the full-year
outlook downward and lowered Q3 revenue guidance,
citing softer demand."
embed THIS, not the question, and search with it.
The factual accuracy of the hypothetical answer does not matter; only its shape and vocabulary do, because you discard it after embedding. HyDE shines on semantic queries with a clear answer vocabulary and struggles on queries the small model knows nothing about (it can hallucinate a misleading probe). PRF needs no model but trusts the first pass; HyDE needs a model but does not. Many systems run both and fuse, which is the next step.
Step 4: Fan Out and Fuse With Reciprocal Rank Fusion
A single phrasing of a query is one sample of a noisy distribution. Generate several paraphrases, or several decomposed sub-queries, run them in parallel, and merge the ranked lists. This multi-query fan-out is robust precisely because the variations disagree: an item that ranks well across several phrasings is more likely truly relevant than one that spikes on a single lucky phrasing.
The merge step needs a method that does not depend on each list's raw scores being comparable, because a lexical list and a dense list produce scores on totally different scales. Reciprocal Rank Fusion (RRF) solves this by throwing away the scores and using only the rank position:
rrf_score(d) = sum over each result list L of
1 / (k + rank of d in L)
k is a small constant, commonly 60, that dampens the
influence of the very top ranks.
Each list contributes a vote inversely proportional to where the document appeared in it. A document ranked first in three of five lists beats one ranked first in one list and absent from the rest. RRF is parameter-light, scale-free, and is exactly the same machinery that fuses a dense list with a lexical (BM25) list in hybrid search, which is why it doubles as the merge step for multi-query fan-out and for cross-modality fusion. The same fusion that combines paraphrases combines modalities.
Step 5: Route to the Right Modality
For a multimodal corpus, the most consequential transform is deciding which index the query should hit. A query about spoken claims belongs on the transcript index; a query about visual style belongs on the image-embedding index; a query about a price on screen belongs on the OCR index. Sending every query to every index is wasteful and noisy; sending it to the wrong one returns confident garbage.
Modality routing can be a classifier ("this query is about audio content") or, for compound queries, a decomposition that sends each sub-query to its natural modality and then fuses with RRF:
compound query: "the demo where the bottle is shown and the
narrator says free returns"
sub-query A -> visual index : "bottle shown in demo"
sub-query B -> transcript index: "narrator says free returns"
fuse the two ranked lists with RRF; the clip that ranks
well on BOTH rises to the top.
This is the agent-perception version of query transformation: the transform does not just clean the words, it picks the sense organ. A clip that satisfies both conditions appears in both lists and wins the fusion; a clip that only shows the bottle, or only mentions returns, ranks lower because it only earns one set of votes.
Choosing Transforms
| Query type | Best transform | Why |
| Navigational / filter | Route to metadata filter, no vector search | Exact constraints, vector search only adds noise |
| Lookup / exact | Light lexical, minimal rewrite | Precision matters, expansion hurts |
| Semantic | HyDE or expansion | Vocabulary gap is the bottleneck |
| Compound / multimodal | Decompose, route per sub-query, RRF | One vector cannot satisfy several conditions |
| Reasoning trace | Intent extraction first, then the above | Search need is buried in context |
Evaluating the Transform Itself
The trap is to evaluate only end-to-end answer quality, which hides where the win or loss came from. Isolate the query stage and measure it directly:
Walk each transform from off to on and keep only the ones where candidate recall improves more than latency degrades for the query types they target.
Doing This in Mixpeek
In Mixpeek the query transformation lives in front of the retriever stages. You route by classifying the query, optionally rewrite or fan it out, and let the retriever fuse the results, including a hybrid dense-plus-lexical (BM25) first stage whose lists are merged with RRF, the same fusion that merges your paraphrases or your per-modality sub-queries.
from mixpeek import Mixpeek
client = Mixpeek(api_key="mxp_sk_...")
# A retriever whose first stage already fuses dense + lexical with RRF.
retriever = client.retrievers.create(
namespace_id="ns_video",
collection_ids=["col_transcripts", "col_keyframes"],
retriever_name="agent-search-fused",
stages=[
# Hybrid first stage: dense + BM25, merged by reciprocal rank fusion.
{
"stage_name": "hybrid_search",
"parameters": {"fusion": "rrf", "rrf_k": 60, "top_k": 200},
},
{"stage_name": "rerank", "parameters": {"top_k": 20}},
],
)
# Application-side query transformation: classify, then route + fan out.
def transform_and_search(raw_query: str):
qtype = classify(raw_query) # your small classifier
if qtype == "navigational":
# Skip vector search; push constraints into a filter instead.
return client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"filters": to_filter(raw_query)},
top_k=20,
)
# Semantic / compound: fan out into paraphrases or sub-queries,
# run each, and let the retriever's RRF do the merge.
queries = expand_or_decompose(raw_query) # HyDE, paraphrase, or split
return client.retrievers.execute(
retriever_id=retriever.retriever_id,
inputs={"text": queries}, # multiple probes, one fused result
top_k=20,
)
Treat the classifier, the expansion strategy, and the fan-out width as versioned config, because each one changes the candidate pool an agent reasons over. Measure candidate-stage recall per query type before and after a change, and pick the embedding for the dense half of the hybrid stage, on the Models page, that matches the modality each sub-query is routed to.