Why Single-Stage Search Fails
The simplest search pipeline is one stage: encode a query, find the nearest vectors, return results. This works for demos and prototypes. It falls apart in production for three reasons:
Recall-precision tradeoff. A single embedding model cannot simultaneously optimize for broad recall (finding everything relevant) and precise ranking (putting the best results first). Models trained for recall produce embeddings that cluster related items loosely — great for not missing anything, terrible for ranking. Models trained for precision produce tight clusters — great for ranking, terrible if the query is even slightly different from the indexed text.
Modality mismatch. An agent searching a media library might need to find "the scene where the CEO discusses Q3 revenue." This query spans multiple modalities: video frames (the CEO's face), audio (their speech), and text (the transcript mentioning "Q3 revenue"). No single embedding model captures all three.
Heterogeneous quality signals. Relevance is not one-dimensional. A video clip might match visually (correct scene), match textually (correct transcript), but be from the wrong time period. Or match the transcript perfectly but come from a different speaker. A single similarity score cannot express these independent quality dimensions.
Multi-stage retrieval solves all three by decomposing search into a pipeline where each stage handles one concern.
The Retrieval Funnel
Multi-stage retrieval follows a funnel pattern: each stage reduces the candidate set while increasing the quality of ranking.
Stage 1: Candidate Generation (broad recall)
Input: full corpus (millions of items)
Output: top-1000 candidates
Method: ANN vector search, BM25, or both
Stage 2: Feature Filtering (structured constraints)
Input: 1000 candidates
Output: 200-500 candidates
Method: metadata filters, attribute matching
Stage 3: Cross-Encoder Reranking (precise scoring)
Input: 200-500 candidates
Output: top-50 candidates
Method: cross-encoder that scores (query, document) pairs
Stage 4: LLM Validation (semantic verification)
Input: top-50 candidates
Output: top-10 results
Method: LLM judges relevance with full context
Each stage is 10-100x more expensive per item than the previous one, but processes 10-100x fewer items. The total cost is dominated by Stage 1 (cheap, over many items) rather than Stage 4 (expensive, over few items).
Stage 1: Candidate Generation
The first stage's only job is recall — do not miss any potentially relevant item. Precision does not matter here because later stages will fix the ranking. This is the most important design decision: if a relevant item is not retrieved in Stage 1, no later stage can recover it.
Dense Retrieval (Vector Search)
Encode the query with the same embedding model used to index the corpus. Find the top-K nearest neighbors using approximate nearest neighbor (ANN) search.
query_embedding = model.encode(query_text)
candidates = vector_index.search(query_embedding, top_k=1000)
Strengths: Captures semantic similarity ("automobile" matches "car"), works across modalities if the embedding model supports it (CLIP for text-to-image, CLAP for text-to-audio).
Weaknesses: Misses exact keyword matches. If someone searches for "patent #US12345678," a semantic embedding might not retrieve it because the model has no special representation for patent numbers.
Sparse Retrieval (BM25 / Keyword Search)
Classic term-frequency search. No embeddings — just count how often query terms appear in documents, weighted by inverse document frequency.
Strengths: Perfect for exact matches, entity names, product IDs, technical terms. Zero false negatives for exact query terms.
Weaknesses: No semantic understanding. "automobile" does not match "car."
Hybrid Retrieval
Run both dense and sparse retrieval in parallel, then merge the results. This is the production standard for Stage 1.
dense_results = vector_index.search(query_embedding, top_k=1000)
sparse_results = bm25_index.search(query_text, top_k=1000)
# Merge with reciprocal rank fusion
candidates = reciprocal_rank_fusion([dense_results, sparse_results], k=60)
Hybrid retrieval consistently outperforms either method alone because they have complementary failure modes.
Reciprocal Rank Fusion (RRF)
RRF is the standard algorithm for merging ranked lists from different retrieval methods. It is simple, effective, and requires no training.
For each document d that appears in any of the ranked lists, compute:
RRF_score(d) = Σ_r 1 / (k + rank_r(d))
Where:
Why RRF works: It converts absolute scores (which are not comparable across models) into rank-based scores (which are). A document ranked #1 by BM25 gets 1/(60+1) = 0.0164. The same document ranked #5 by vector search gets 1/(60+5) = 0.0154. Its RRF score is 0.0318. A document ranked #1 by vector search but not in BM25's top-1000 gets only 0.0164.
Key property: RRF is unsupervised — it requires no relevance labels or training data. This makes it the default choice for combining retrieval signals in new domains where you have no labeled data yet.
Alternatives to RRF
Linear combination: `score = α * dense_score + (1-α) * sparse_score`. Requires normalizing scores to the same range and tuning α. Can outperform RRF when you have validation data to optimize α.
Learned fusion: A small neural network that takes multiple retrieval scores as input and produces a final relevance score. Requires training data. Outperforms RRF when you have sufficient labeled queries, but overfits when you do not.
Stage 2: Feature Filtering
After candidate generation, apply structured filters to remove items that cannot possibly be relevant. This stage is cheap (metadata lookup, not model inference) and can dramatically reduce the candidate set.
Common Filters for Multimodal Data
Temporal filters: Only return results from the last 30 days, or from a specific date range.
Speaker filters: For audio/video, only return segments where a specific speaker is talking. This requires speaker diarization during indexing.
Object presence filters: Only return frames that contain a specific detected object class. This requires object detection during indexing.
Modality filters: Only return video results, or only audio results, depending on the query.
Confidence filters: Only return items where the extraction confidence exceeded a threshold. If OCR extracted text with 40% confidence, it might be noise.
# Filter to video segments from the last week with the CEO speaking
filtered = [
c for c in candidates
if c.metadata["created_at"] > seven_days_ago
and "CEO" in c.metadata.get("speakers", [])
and c.metadata["media_type"] == "video"
]
Why Not Filter First?
A common question: why not apply filters before vector search to reduce the index size?
Pre-filtering (filter then search) is faster but produces worse results. The vector index over the filtered subset may have different nearest-neighbor properties than the full index, and you lose the ability to rank filtered-out items that are semantically relevant but fail the filter.
Post-filtering (search then filter) guarantees you find the most semantically relevant items first, then apply hard constraints. The cost is that some of your top-K candidates will be filtered out, reducing the effective candidate set. The standard workaround is to over-retrieve in Stage 1 (top-2000 instead of top-1000) to compensate for filter attrition.
Stage 3: Cross-Encoder Reranking
Stages 1 and 2 use bi-encoders — the query and document are encoded independently, then compared with a dot product or cosine similarity. This is fast (pre-compute document embeddings, only encode the query at search time) but limited: the model cannot compare the query and document at the token level.
A cross-encoder takes the query and document as a single input and produces a relevance score. The query and document tokens attend to each other through every transformer layer, enabling much richer comparison.
Bi-encoder: encode(query) · encode(document) → score
Cross-encoder: encode(query + document) → score
Cross-encoders are 100-1000x slower than bi-encoders because they cannot pre-compute document representations — every (query, document) pair requires a full forward pass. But they are dramatically more accurate, especially for nuanced relevance judgments.
When Reranking Helps Most
Reranking provides the largest improvement when:
1. The query is ambiguous. "Apple" could mean the fruit, the company, or the record label. A bi-encoder returns results for all interpretations. A cross-encoder, seeing the full query context, can disambiguate.
2. Relevance requires reasoning. "Videos where someone demonstrates the product but does not mention the price" requires understanding negation, which bi-encoders handle poorly.
3. The domain is specialized. In legal or medical search, relevance depends on specific terminology and context that general-purpose embeddings may not capture. A cross-encoder fine-tuned on domain data outperforms a general bi-encoder.
Multimodal Reranking
For multimodal pipelines, the reranker must handle mixed-modality inputs. Recent models like Qwen3-VL-Reranker accept (text query, image/video document) pairs and score them directly. This is a significant upgrade over text-only reranking, which requires converting visual content to text (via captioning) before scoring.
# Text-only reranking (loses visual information)
for candidate in candidates:
text_repr = candidate.caption + " " + candidate.transcript
score = text_reranker.score(query, text_repr)
# Multimodal reranking (preserves visual information)
for candidate in candidates:
score = mm_reranker.score(
query=query_text,
image=candidate.thumbnail,
text=candidate.transcript
)
Stage 4: LLM Validation
The final stage uses a large language model to verify relevance with full context awareness. This is the most expensive stage but processes only the top 20-50 candidates.
LLM-as-Judge
The LLM reads the query and the candidate's full context (transcript, caption, metadata) and produces a structured relevance judgment:
prompt = f"""Given this search query: "{query}"
And this candidate result:
Type: {candidate.media_type}
Transcript: {candidate.transcript[:500]}
Scene description: {candidate.caption}
Detected objects: {candidate.objects}
Speaker: {candidate.speaker}
Rate the relevance on a scale of 1-5:
1 = completely irrelevant
2 = tangentially related
3 = somewhat relevant
4 = highly relevant
5 = perfect match
Also explain in one sentence why."""
judgment = llm.generate(prompt)
When to Use LLM Validation
LLM validation is valuable when:
Self-RAG: The Agent Decides
In Self-RAG (Self-Reflective Retrieval-Augmented Generation), the agent itself decides whether to retrieve, what to retrieve, and whether the retrieved results are useful. The agent:
1. Evaluates whether retrieval is needed for the current query 2. If yes, generates a retrieval query (which may differ from the user's original question) 3. Evaluates each retrieved result for relevance 4. Decides whether to use the results or try a different retrieval strategy
This turns the retrieval pipeline from a fixed sequence into an adaptive loop controlled by the agent's reasoning.
Composing Multimodal Pipelines
In a multimodal corpus, different features require different retrieval strategies. The composition pattern: run specialized retrieval for each modality, then fuse the results.
Example: Video Archive Search
Query: "the part where the engineer explains the memory leak"
Pipeline:
┌─ Visual: CLIP search on frame embeddings → top-200
├─ Audio: CLAP search on audio embeddings → top-200
├─ Text: BM25 + BGE search on transcripts → top-200
└─ Filter: speaker_role = "engineer" → apply to all
→ RRF fusion across visual + audio + text results → top-100
→ Cross-encoder rerank on (query, transcript+frame) → top-20
→ LLM validation with full context → top-5
Each modality contributes unique signal:
RRF fusion gives the highest score to segments that rank well across all three modalities. A segment where someone says "memory leak" while pointing at a code snippet will outrank a segment that only mentions it in passing.
Cascading vs. Parallel Composition
Parallel composition (shown above) runs all retrieval stages independently, then fuses. This maximizes recall because each stage can find results the others miss.
Cascading composition runs stages sequentially — the output of one stage becomes the input to the next. This is more efficient but risks losing results if an early stage has low recall for certain query types.
Parallel: A ──┐
B ──┼── Fusion → Rerank → Results
C ──┘
Cascading: A → B → C → Rerank → Results
Rule of thumb: Use parallel composition when your retrieval stages search different modalities or use different algorithms. Use cascading when each stage progressively refines the same candidate set.
Latency Budgets and Practical Tradeoffs
Production search systems have latency constraints — typically 200-500ms for interactive search, 2-5 seconds for batch or agent-driven search.
Typical Stage Latencies
| Stage | Items Processed | Per-Item Cost | Total Latency |
| ANN vector search | 10M → 1000 | <0.001ms | 5-20ms |
| BM25 keyword search | 10M → 1000 | <0.001ms | 10-30ms |
| RRF fusion | 2000 → 1000 | <0.01ms | 1-5ms |
| Metadata filtering | 1000 → 300 | <0.01ms | 1-5ms |
| Cross-encoder rerank | 300 → 50 | 5-15ms | 1.5-4.5s |
| LLM validation | 50 → 10 | 200-500ms | 2-5s |
1. Reduce candidates aggressively. Going from 300 to 100 cross-encoder inputs cuts reranking time by 67%. 2. Batch processing. Cross-encoders and LLMs process batches efficiently on GPUs. Scoring 50 pairs in one batch is much faster than 50 individual calls. 3. Early termination. If the top result has a score far above the second result, skip remaining validations.
When to Skip Stages
Not every query needs every stage:
An agent-driven pipeline can dynamically choose which stages to run based on query analysis.
Evaluation: Measuring Pipeline Quality
Stage-Level Metrics
Each stage should be evaluated independently:
Stage 1 (recall): Recall@K — what fraction of relevant items are in the top-K candidates? Aim for Recall@1000 > 0.95.
Stage 2 (filtering): Precision after filtering — what fraction of remaining candidates are relevant? Also measure filter attrition: if filtering removes 90% of candidates, your top-K from Stage 1 might be too small.
Stage 3 (reranking): NDCG@20 — are the best results ranked highest? Compare NDCG before and after reranking to measure the reranker's contribution.
Stage 4 (validation): Precision@10 — in the final result set, what fraction are truly relevant? This is the metric users see.
End-to-End Metrics
Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant result. For navigational queries ("find the specific meeting about X"), MRR is the most important metric.
NDCG@K: Normalized discounted cumulative gain. For exploratory queries ("find all discussions about memory management"), NDCG captures the quality of the entire ranked list, not just the top result.
Latency at P95: The 95th percentile latency. Multi-stage pipelines have high variance — some queries hit every stage, others terminate early. P95 captures the worst-case user experience.
Common Pitfalls
Optimizing stages independently. Improving Stage 1 recall might hurt Stage 3 reranking by flooding it with marginally relevant candidates. Always measure end-to-end metrics after changing any stage.
Over-relying on the LLM stage. If your LLM validation is doing most of the ranking work, your earlier stages are not doing their job. The pipeline should converge on good results progressively — each stage should improve the ranking measurably.
Using the same model for retrieval and reranking. A bi-encoder trained for broad recall makes a poor reranker. A cross-encoder trained for precision makes a poor candidate generator (too slow). Use different models for different stages.
Not measuring recall at Stage 1. If Stage 1 misses relevant items, no later stage can recover them. Measure recall separately from the full pipeline, and invest in recall improvements first.
Fixed pipelines for all queries. A keyword query ("patent #US12345678") and a semantic query ("discussions about our patent strategy") need completely different pipeline configurations. Agent-driven query analysis that routes to the right pipeline configuration is worth the engineering investment.