Multi-Stage Retrieval: How AI Agents Search Unstructured Data at Scale

Why Single-Stage Search Fails

The simplest search pipeline is one stage: encode a query, find the nearest vectors, return results. This works for demos and prototypes. It falls apart in production for three reasons:

Recall-precision tradeoff. A single embedding model cannot simultaneously optimize for broad recall (finding everything relevant) and precise ranking (putting the best results first). Models trained for recall produce embeddings that cluster related items loosely: great for not missing anything, terrible for ranking. Models trained for precision produce tight clusters: great for ranking, terrible if the query is even slightly different from the indexed text.

Modality mismatch. An agent searching a media library might need to find "the scene where the CEO discusses Q3 revenue." This query spans multiple modalities: video frames (the CEO's face), audio (their speech), and text (the transcript mentioning "Q3 revenue"). No single embedding model captures all three.

Heterogeneous quality signals. Relevance is not one-dimensional. A video clip might match visually (correct scene), match textually (correct transcript), but be from the wrong time period. Or match the transcript perfectly but come from a different speaker. A single similarity score cannot express these independent quality dimensions.

Multi-stage retrieval solves all three by decomposing search into a pipeline where each stage handles one concern.

The Retrieval Funnel

Multi-stage retrieval follows a funnel pattern: each stage reduces the candidate set while increasing the quality of ranking.

Stage 1: Candidate Generation (broad recall)
  Input:  full corpus (millions of items)
  Output: top-1000 candidates
  Method: ANN vector search, BM25, or both

Stage 2: Feature Filtering (structured constraints)
  Input:  1000 candidates
  Output: 200-500 candidates
  Method: metadata filters, attribute matching

Stage 3: Cross-Encoder Reranking (precise scoring)
  Input:  200-500 candidates
  Output: top-50 candidates
  Method: cross-encoder that scores (query, document) pairs

Stage 4: LLM Validation (semantic verification)
  Input:  top-50 candidates
  Output: top-10 results
  Method: LLM judges relevance with full context

Each stage is 10-100x more expensive per item than the previous one, but processes 10-100x fewer items. The total cost is dominated by Stage 1 (cheap, over many items) rather than Stage 4 (expensive, over few items).

Stage 1: Candidate Generation

The first stage's only job is recall: do not miss any potentially relevant item. Precision does not matter here because later stages will fix the ranking. This is the most important design decision: if a relevant item is not retrieved in Stage 1, no later stage can recover it.

Dense Retrieval (Vector Search)

Encode the query with the same embedding model used to index the corpus. Find the top-K nearest neighbors using approximate nearest neighbor (ANN) search.

query_embedding = model.encode(query_text)
candidates = vector_index.search(query_embedding, top_k=1000)

Strengths: Captures semantic similarity ("automobile" matches "car"), works across modalities if the embedding model supports it (CLIP for text-to-image, CLAP for text-to-audio).

Weaknesses: Misses exact keyword matches. If someone searches for "patent #US12345678," a semantic embedding might not retrieve it because the model has no special representation for patent numbers.

Sparse Retrieval (BM25 / Keyword Search)

Classic term-frequency search. No embeddings: just count how often query terms appear in documents, weighted by inverse document frequency.

Strengths: Perfect for exact matches, entity names, product IDs, technical terms. Zero false negatives for exact query terms.

Weaknesses: No semantic understanding. "automobile" does not match "car."

Hybrid Retrieval

Run both dense and sparse retrieval in parallel, then merge the results. This is the production standard for Stage 1.

dense_results = vector_index.search(query_embedding, top_k=1000)
sparse_results = bm25_index.search(query_text, top_k=1000)

# Merge with reciprocal rank fusion
candidates = reciprocal_rank_fusion([dense_results, sparse_results], k=60)

Hybrid retrieval consistently outperforms either method alone because they have complementary failure modes.

Reciprocal Rank Fusion (RRF)

RRF is the standard algorithm for merging ranked lists from different retrieval methods. It is simple, effective, and requires no training.

For each document d that appears in any of the ranked lists, compute:

RRF_score(d) = Σ_r  1 / (k + rank_r(d))

Where:

The sum is over all ranked lists r where d appears

rank_r(d) is the rank of d in list r (1-indexed)

k is a constant (typically 60) that dampens the influence of high-ranked items

Why RRF works: It converts absolute scores (which are not comparable across models) into rank-based scores (which are). A document ranked #1 by BM25 gets 1/(60+1) = 0.0164. The same document ranked #5 by vector search gets 1/(60+5) = 0.0154. Its RRF score is 0.0318. A document ranked #1 by vector search but not in BM25's top-1000 gets only 0.0164.

Key property: RRF is unsupervised, it requires no relevance labels or training data. This makes it the default choice for combining retrieval signals in new domains where you have no labeled data yet.

Alternatives to RRF

Linear combination: score = α * dense_score + (1-α) * sparse_score. Requires normalizing scores to the same range and tuning α. Can outperform RRF when you have validation data to optimize α.

Learned fusion: A small neural network that takes multiple retrieval scores as input and produces a final relevance score. Requires training data. Outperforms RRF when you have sufficient labeled queries, but overfits when you do not.

Stage 2: Feature Filtering

After candidate generation, apply structured filters to remove items that cannot possibly be relevant. This stage is cheap (metadata lookup, not model inference) and can dramatically reduce the candidate set.

Common Filters for Multimodal Data

Temporal filters: Only return results from the last 30 days, or from a specific date range.

Speaker filters: For audio/video, only return segments where a specific speaker is talking. This requires speaker diarization during indexing.

Object presence filters: Only return frames that contain a specific detected object class. This requires object detection during indexing.

Modality filters: Only return video results, or only audio results, depending on the query.

Confidence filters: Only return items where the extraction confidence exceeded a threshold. If OCR extracted text with 40% confidence, it might be noise.

# Filter to video segments from the last week with the CEO speaking
filtered = [
    c for c in candidates
    if c.metadata["created_at"] > seven_days_ago
    and "CEO" in c.metadata.get("speakers", [])
    and c.metadata["media_type"] == "video"
]

Why Not Filter First?

A common question: why not apply filters before vector search to reduce the index size?

Pre-filtering (filter then search) is faster but produces worse results. The vector index over the filtered subset may have different nearest-neighbor properties than the full index, and you lose the ability to rank filtered-out items that are semantically relevant but fail the filter.

Post-filtering (search then filter) guarantees you find the most semantically relevant items first, then apply hard constraints. The cost is that some of your top-K candidates will be filtered out, reducing the effective candidate set. The standard workaround is to over-retrieve in Stage 1 (top-2000 instead of top-1000) to compensate for filter attrition.

Stage 3: Cross-Encoder Reranking

Stages 1 and 2 use bi-encoders: the query and document are encoded independently, then compared with a dot product or cosine similarity. This is fast (pre-compute document embeddings, only encode the query at search time) but limited: the model cannot compare the query and document at the token level.

A cross-encoder takes the query and document as a single input and produces a relevance score. The query and document tokens attend to each other through every transformer layer, enabling much richer comparison.

Bi-encoder:    encode(query) · encode(document) → score
Cross-encoder: encode(query + document) → score

Cross-encoders are 100-1000x slower than bi-encoders because they cannot pre-compute document representations: every (query, document) pair requires a full forward pass. But they are dramatically more accurate, especially for nuanced relevance judgments.

When Reranking Helps Most

Reranking provides the largest improvement when:

1. The query is ambiguous. "Apple" could mean the fruit, the company, or the record label. A bi-encoder returns results for all interpretations. A cross-encoder, seeing the full query context, can disambiguate.

2. Relevance requires reasoning. "Videos where someone demonstrates the product but does not mention the price" requires understanding negation, which bi-encoders handle poorly.

3. The domain is specialized. In legal or medical search, relevance depends on specific terminology and context that general-purpose embeddings may not capture. A cross-encoder fine-tuned on domain data outperforms a general bi-encoder.

Multimodal Reranking

For multimodal pipelines, the reranker must handle mixed-modality inputs. Recent models like Qwen3-VL-Reranker accept (text query, image/video document) pairs and score them directly. This is a significant upgrade over text-only reranking, which requires converting visual content to text (via captioning) before scoring.

# Text-only reranking (loses visual information)
for candidate in candidates:
    text_repr = candidate.caption + " " + candidate.transcript
    score = text_reranker.score(query, text_repr)

# Multimodal reranking (preserves visual information)
for candidate in candidates:
    score = mm_reranker.score(
        query=query_text,
        image=candidate.thumbnail,
        text=candidate.transcript
    )

Stage 4: LLM Validation

The final stage uses a large language model to verify relevance with full context awareness. This is the most expensive stage but processes only the top 20-50 candidates.

LLM-as-Judge

The LLM reads the query and the candidate's full context (transcript, caption, metadata) and produces a structured relevance judgment:

prompt = f"""Given this search query: "{query}"

And this candidate result:
- Type: {candidate.media_type}
- Transcript: {candidate.transcript[:500]}
- Scene description: {candidate.caption}
- Detected objects: {candidate.objects}
- Speaker: {candidate.speaker}

Rate the relevance on a scale of 1-5:
1 = completely irrelevant
2 = tangentially related
3 = somewhat relevant
4 = highly relevant
5 = perfect match

Also explain in one sentence why."""

judgment = llm.generate(prompt)

When to Use LLM Validation

LLM validation is valuable when:

False positives are costly. In compliance search ("find all instances of insider trading discussion"), a false positive triggers unnecessary legal review. LLM validation catches subtle non-matches that embedding similarity misses.

The query requires reasoning. "Find meetings where we agreed to a timeline but did not assign an owner" requires understanding commitment speech acts and their absence: well beyond what embedding similarity can capture.

The user expects explanations. The LLM can explain why each result is relevant, which builds trust in the search system.

Self-RAG: The Agent Decides

In Self-RAG (Self-Reflective Retrieval-Augmented Generation), the agent itself decides whether to retrieve, what to retrieve, and whether the retrieved results are useful. The agent:

1. Evaluates whether retrieval is needed for the current query 2. If yes, generates a retrieval query (which may differ from the user's original question) 3. Evaluates each retrieved result for relevance 4. Decides whether to use the results or try a different retrieval strategy

This turns the retrieval pipeline from a fixed sequence into an adaptive loop controlled by the agent's reasoning.

Composing Multimodal Pipelines

In a multimodal corpus, different features require different retrieval strategies. The composition pattern: run specialized retrieval for each modality, then fuse the results.

Example: Video Archive Search

Query: "the part where the engineer explains the memory leak"

Pipeline:
  ┌─ Visual: CLIP search on frame embeddings         → top-200
  ├─ Audio:  CLAP search on audio embeddings          → top-200
  ├─ Text:   BM25 + BGE search on transcripts         → top-200
  └─ Filter: speaker_role = "engineer"                → apply to all

  → RRF fusion across visual + audio + text results   → top-100
  → Cross-encoder rerank on (query, transcript+frame) → top-20
  → LLM validation with full context                  → top-5

Each modality contributes unique signal:

Visual finds frames showing someone at a whiteboard or screen

Audio finds segments with technical explanation tone

Text finds transcript mentions of "memory leak"

Filter restricts to segments where the speaker has an engineer role

RRF fusion gives the highest score to segments that rank well across all three modalities. A segment where someone says "memory leak" while pointing at a code snippet will outrank a segment that only mentions it in passing.

Cascading vs. Parallel Composition

Parallel composition (shown above) runs all retrieval stages independently, then fuses. This maximizes recall because each stage can find results the others miss.

Cascading composition runs stages sequentially: the output of one stage becomes the input to the next. This is more efficient but risks losing results if an early stage has low recall for certain query types.

Parallel:   A ──┐
            B ──┼── Fusion → Rerank → Results
            C ──┘

Cascading:  A → B → C → Rerank → Results

Rule of thumb: Use parallel composition when your retrieval stages search different modalities or use different algorithms. Use cascading when each stage progressively refines the same candidate set.

Latency Budgets and Practical Tradeoffs

Production search systems have latency constraints: typically 200-500ms for interactive search, 2-5 seconds for batch or agent-driven search.

Typical Stage Latencies

Stage

Items Processed

Per-Item Cost

Total Latency

ANN vector search	10M → 1000	<0.001ms	5-20ms
BM25 keyword search	10M → 1000	<0.001ms	10-30ms
RRF fusion	2000 → 1000	<0.01ms	1-5ms
Metadata filtering	1000 → 300	<0.01ms	1-5ms
Cross-encoder rerank	300 → 50	5-15ms	1.5-4.5s
LLM validation	50 → 10	200-500ms	2-5s

The bottleneck is always the cross-encoder and LLM stages. Two optimizations:

1. Reduce candidates aggressively. Going from 300 to 100 cross-encoder inputs cuts reranking time by 67%. 2. Batch processing. Cross-encoders and LLMs process batches efficiently on GPUs. Scoring 50 pairs in one batch is much faster than 50 individual calls. 3. Early termination. If the top result has a score far above the second result, skip remaining validations.

When to Skip Stages

Not every query needs every stage:

High-confidence queries: If Stage 1 returns a result with similarity > 0.95, skip reranking: the embedding model is confident.

Exact match queries: If the query is a specific ID or exact phrase, BM25 alone suffices, skip vector search entirely.

Simple filters: If the query is just "all videos from last Tuesday," skip all retrieval stages and use a metadata query.

An agent-driven pipeline can dynamically choose which stages to run based on query analysis.

Evaluation: Measuring Pipeline Quality

Stage-Level Metrics

Each stage should be evaluated independently:

Stage 1 (recall): Recall@K: what fraction of relevant items are in the top-K candidates? Aim for Recall@1000 > 0.95.

Stage 2 (filtering): Precision after filtering, what fraction of remaining candidates are relevant? Also measure filter attrition: if filtering removes 90% of candidates, your top-K from Stage 1 might be too small.

Stage 3 (reranking): NDCG@20, are the best results ranked highest? Compare NDCG before and after reranking to measure the reranker's contribution.

Stage 4 (validation): Precision@10, in the final result set, what fraction are truly relevant? This is the metric users see.

End-to-End Metrics

Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant result. For navigational queries ("find the specific meeting about X"), MRR is the most important metric.

NDCG@K: Normalized discounted cumulative gain. For exploratory queries ("find all discussions about memory management"), NDCG captures the quality of the entire ranked list, not just the top result.

Latency at P95: The 95th percentile latency. Multi-stage pipelines have high variance: some queries hit every stage, others terminate early. P95 captures the worst-case user experience.

Common Pitfalls

Optimizing stages independently. Improving Stage 1 recall might hurt Stage 3 reranking by flooding it with marginally relevant candidates. Always measure end-to-end metrics after changing any stage.

Over-relying on the LLM stage. If your LLM validation is doing most of the ranking work, your earlier stages are not doing their job. The pipeline should converge on good results progressively: each stage should improve the ranking measurably.

Using the same model for retrieval and reranking. A bi-encoder trained for broad recall makes a poor reranker. A cross-encoder trained for precision makes a poor candidate generator (too slow). Use different models for different stages.

Not measuring recall at Stage 1. If Stage 1 misses relevant items, no later stage can recover them. Measure recall separately from the full pipeline, and invest in recall improvements first.

Fixed pipelines for all queries. A keyword query ("patent #US12345678") and a semantic query ("discussions about our patent strategy") need completely different pipeline configurations. Agent-driven query analysis that routes to the right pipeline configuration is worth the engineering investment.