NEWManaged multimodal retrieval.Explore platform →
    Agent Perception
    22 min read
    Updated 2026-05-11

    Multi-Stage Retrieval: How AI Agents Search Unstructured Data at Scale

    A deep technical guide to multi-stage retrieval pipeline design. Covers the recall-precision tradeoff, candidate generation strategies, reciprocal rank fusion, cross-encoder reranking, LLM-as-judge validation, and how to compose retrieval stages for multimodal search across video, image, audio, and documents.

    Retrieval
    Search Pipelines
    Agent Architecture
    Multimodal Search

    Why Single-Stage Search Fails



    The simplest search pipeline is one stage: encode a query, find the nearest vectors, return results. This works for demos and prototypes. It falls apart in production for three reasons:

    Recall-precision tradeoff. A single embedding model cannot simultaneously optimize for broad recall (finding everything relevant) and precise ranking (putting the best results first). Models trained for recall produce embeddings that cluster related items loosely — great for not missing anything, terrible for ranking. Models trained for precision produce tight clusters — great for ranking, terrible if the query is even slightly different from the indexed text.

    Modality mismatch. An agent searching a media library might need to find "the scene where the CEO discusses Q3 revenue." This query spans multiple modalities: video frames (the CEO's face), audio (their speech), and text (the transcript mentioning "Q3 revenue"). No single embedding model captures all three.

    Heterogeneous quality signals. Relevance is not one-dimensional. A video clip might match visually (correct scene), match textually (correct transcript), but be from the wrong time period. Or match the transcript perfectly but come from a different speaker. A single similarity score cannot express these independent quality dimensions.

    Multi-stage retrieval solves all three by decomposing search into a pipeline where each stage handles one concern.

    The Retrieval Funnel



    Multi-stage retrieval follows a funnel pattern: each stage reduces the candidate set while increasing the quality of ranking.

    Stage 1: Candidate Generation (broad recall)
      Input:  full corpus (millions of items)
      Output: top-1000 candidates
      Method: ANN vector search, BM25, or both

    Stage 2: Feature Filtering (structured constraints) Input: 1000 candidates Output: 200-500 candidates Method: metadata filters, attribute matching

    Stage 3: Cross-Encoder Reranking (precise scoring) Input: 200-500 candidates Output: top-50 candidates Method: cross-encoder that scores (query, document) pairs

    Stage 4: LLM Validation (semantic verification) Input: top-50 candidates Output: top-10 results Method: LLM judges relevance with full context


    Each stage is 10-100x more expensive per item than the previous one, but processes 10-100x fewer items. The total cost is dominated by Stage 1 (cheap, over many items) rather than Stage 4 (expensive, over few items).

    Stage 1: Candidate Generation



    The first stage's only job is recall — do not miss any potentially relevant item. Precision does not matter here because later stages will fix the ranking. This is the most important design decision: if a relevant item is not retrieved in Stage 1, no later stage can recover it.

    Dense Retrieval (Vector Search)



    Encode the query with the same embedding model used to index the corpus. Find the top-K nearest neighbors using approximate nearest neighbor (ANN) search.

    query_embedding = model.encode(query_text)
    candidates = vector_index.search(query_embedding, top_k=1000)
    


    Strengths: Captures semantic similarity ("automobile" matches "car"), works across modalities if the embedding model supports it (CLIP for text-to-image, CLAP for text-to-audio).

    Weaknesses: Misses exact keyword matches. If someone searches for "patent #US12345678," a semantic embedding might not retrieve it because the model has no special representation for patent numbers.

    Sparse Retrieval (BM25 / Keyword Search)



    Classic term-frequency search. No embeddings — just count how often query terms appear in documents, weighted by inverse document frequency.

    Strengths: Perfect for exact matches, entity names, product IDs, technical terms. Zero false negatives for exact query terms.

    Weaknesses: No semantic understanding. "automobile" does not match "car."

    Hybrid Retrieval



    Run both dense and sparse retrieval in parallel, then merge the results. This is the production standard for Stage 1.

    dense_results = vector_index.search(query_embedding, top_k=1000)
    sparse_results = bm25_index.search(query_text, top_k=1000)

    # Merge with reciprocal rank fusion candidates = reciprocal_rank_fusion([dense_results, sparse_results], k=60)


    Hybrid retrieval consistently outperforms either method alone because they have complementary failure modes.

    Reciprocal Rank Fusion (RRF)



    RRF is the standard algorithm for merging ranked lists from different retrieval methods. It is simple, effective, and requires no training.

    For each document d that appears in any of the ranked lists, compute:

    RRF_score(d) = Σ_r  1 / (k + rank_r(d))
    


    Where:
  1. The sum is over all ranked lists r where d appears
  2. `rank_r(d)` is the rank of d in list r (1-indexed)
  3. k is a constant (typically 60) that dampens the influence of high-ranked items


  4. Why RRF works: It converts absolute scores (which are not comparable across models) into rank-based scores (which are). A document ranked #1 by BM25 gets 1/(60+1) = 0.0164. The same document ranked #5 by vector search gets 1/(60+5) = 0.0154. Its RRF score is 0.0318. A document ranked #1 by vector search but not in BM25's top-1000 gets only 0.0164.

    Key property: RRF is unsupervised — it requires no relevance labels or training data. This makes it the default choice for combining retrieval signals in new domains where you have no labeled data yet.

    Alternatives to RRF



    Linear combination: `score = α * dense_score + (1-α) * sparse_score`. Requires normalizing scores to the same range and tuning α. Can outperform RRF when you have validation data to optimize α.

    Learned fusion: A small neural network that takes multiple retrieval scores as input and produces a final relevance score. Requires training data. Outperforms RRF when you have sufficient labeled queries, but overfits when you do not.

    Stage 2: Feature Filtering



    After candidate generation, apply structured filters to remove items that cannot possibly be relevant. This stage is cheap (metadata lookup, not model inference) and can dramatically reduce the candidate set.

    Common Filters for Multimodal Data



    Temporal filters: Only return results from the last 30 days, or from a specific date range.

    Speaker filters: For audio/video, only return segments where a specific speaker is talking. This requires speaker diarization during indexing.

    Object presence filters: Only return frames that contain a specific detected object class. This requires object detection during indexing.

    Modality filters: Only return video results, or only audio results, depending on the query.

    Confidence filters: Only return items where the extraction confidence exceeded a threshold. If OCR extracted text with 40% confidence, it might be noise.

    # Filter to video segments from the last week with the CEO speaking
    filtered = [
        c for c in candidates
        if c.metadata["created_at"] > seven_days_ago
        and "CEO" in c.metadata.get("speakers", [])
        and c.metadata["media_type"] == "video"
    ]
    


    Why Not Filter First?



    A common question: why not apply filters before vector search to reduce the index size?

    Pre-filtering (filter then search) is faster but produces worse results. The vector index over the filtered subset may have different nearest-neighbor properties than the full index, and you lose the ability to rank filtered-out items that are semantically relevant but fail the filter.

    Post-filtering (search then filter) guarantees you find the most semantically relevant items first, then apply hard constraints. The cost is that some of your top-K candidates will be filtered out, reducing the effective candidate set. The standard workaround is to over-retrieve in Stage 1 (top-2000 instead of top-1000) to compensate for filter attrition.

    Stage 3: Cross-Encoder Reranking



    Stages 1 and 2 use bi-encoders — the query and document are encoded independently, then compared with a dot product or cosine similarity. This is fast (pre-compute document embeddings, only encode the query at search time) but limited: the model cannot compare the query and document at the token level.

    A cross-encoder takes the query and document as a single input and produces a relevance score. The query and document tokens attend to each other through every transformer layer, enabling much richer comparison.

    Bi-encoder:    encode(query) · encode(document) → score
    Cross-encoder: encode(query + document) → score
    


    Cross-encoders are 100-1000x slower than bi-encoders because they cannot pre-compute document representations — every (query, document) pair requires a full forward pass. But they are dramatically more accurate, especially for nuanced relevance judgments.

    When Reranking Helps Most



    Reranking provides the largest improvement when:

    1. The query is ambiguous. "Apple" could mean the fruit, the company, or the record label. A bi-encoder returns results for all interpretations. A cross-encoder, seeing the full query context, can disambiguate.

    2. Relevance requires reasoning. "Videos where someone demonstrates the product but does not mention the price" requires understanding negation, which bi-encoders handle poorly.

    3. The domain is specialized. In legal or medical search, relevance depends on specific terminology and context that general-purpose embeddings may not capture. A cross-encoder fine-tuned on domain data outperforms a general bi-encoder.

    Multimodal Reranking



    For multimodal pipelines, the reranker must handle mixed-modality inputs. Recent models like Qwen3-VL-Reranker accept (text query, image/video document) pairs and score them directly. This is a significant upgrade over text-only reranking, which requires converting visual content to text (via captioning) before scoring.

    # Text-only reranking (loses visual information)
    for candidate in candidates:
        text_repr = candidate.caption + " " + candidate.transcript
        score = text_reranker.score(query, text_repr)

    # Multimodal reranking (preserves visual information) for candidate in candidates: score = mm_reranker.score( query=query_text, image=candidate.thumbnail, text=candidate.transcript )


    Stage 4: LLM Validation



    The final stage uses a large language model to verify relevance with full context awareness. This is the most expensive stage but processes only the top 20-50 candidates.

    LLM-as-Judge



    The LLM reads the query and the candidate's full context (transcript, caption, metadata) and produces a structured relevance judgment:

    prompt = f"""Given this search query: "{query}"

    And this candidate result:
  5. Type: {candidate.media_type}
  6. Transcript: {candidate.transcript[:500]}
  7. Scene description: {candidate.caption}
  8. Detected objects: {candidate.objects}
  9. Speaker: {candidate.speaker}


  10. Rate the relevance on a scale of 1-5: 1 = completely irrelevant 2 = tangentially related 3 = somewhat relevant 4 = highly relevant 5 = perfect match

    Also explain in one sentence why."""

    judgment = llm.generate(prompt)


    When to Use LLM Validation



    LLM validation is valuable when:

  11. False positives are costly. In compliance search ("find all instances of insider trading discussion"), a false positive triggers unnecessary legal review. LLM validation catches subtle non-matches that embedding similarity misses.


  12. The query requires reasoning. "Find meetings where we agreed to a timeline but did not assign an owner" requires understanding commitment speech acts and their absence — well beyond what embedding similarity can capture.


  13. The user expects explanations. The LLM can explain why each result is relevant, which builds trust in the search system.


  14. Self-RAG: The Agent Decides



    In Self-RAG (Self-Reflective Retrieval-Augmented Generation), the agent itself decides whether to retrieve, what to retrieve, and whether the retrieved results are useful. The agent:

    1. Evaluates whether retrieval is needed for the current query 2. If yes, generates a retrieval query (which may differ from the user's original question) 3. Evaluates each retrieved result for relevance 4. Decides whether to use the results or try a different retrieval strategy

    This turns the retrieval pipeline from a fixed sequence into an adaptive loop controlled by the agent's reasoning.

    Composing Multimodal Pipelines



    In a multimodal corpus, different features require different retrieval strategies. The composition pattern: run specialized retrieval for each modality, then fuse the results.

    Example: Video Archive Search



    Query: "the part where the engineer explains the memory leak"

    Pipeline: ┌─ Visual: CLIP search on frame embeddings → top-200 ├─ Audio: CLAP search on audio embeddings → top-200 ├─ Text: BM25 + BGE search on transcripts → top-200 └─ Filter: speaker_role = "engineer" → apply to all

    → RRF fusion across visual + audio + text results → top-100 → Cross-encoder rerank on (query, transcript+frame) → top-20 → LLM validation with full context → top-5


    Each modality contributes unique signal:
  15. Visual finds frames showing someone at a whiteboard or screen
  16. Audio finds segments with technical explanation tone
  17. Text finds transcript mentions of "memory leak"
  18. Filter restricts to segments where the speaker has an engineer role


  19. RRF fusion gives the highest score to segments that rank well across all three modalities. A segment where someone says "memory leak" while pointing at a code snippet will outrank a segment that only mentions it in passing.

    Cascading vs. Parallel Composition



    Parallel composition (shown above) runs all retrieval stages independently, then fuses. This maximizes recall because each stage can find results the others miss.

    Cascading composition runs stages sequentially — the output of one stage becomes the input to the next. This is more efficient but risks losing results if an early stage has low recall for certain query types.

    Parallel:   A ──┐
                B ──┼── Fusion → Rerank → Results
                C ──┘

    Cascading: A → B → C → Rerank → Results


    Rule of thumb: Use parallel composition when your retrieval stages search different modalities or use different algorithms. Use cascading when each stage progressively refines the same candidate set.

    Latency Budgets and Practical Tradeoffs



    Production search systems have latency constraints — typically 200-500ms for interactive search, 2-5 seconds for batch or agent-driven search.

    Typical Stage Latencies



    StageItems ProcessedPer-Item CostTotal Latency
    ANN vector search10M → 1000<0.001ms5-20ms
    BM25 keyword search10M → 1000<0.001ms10-30ms
    RRF fusion2000 → 1000<0.01ms1-5ms
    Metadata filtering1000 → 300<0.01ms1-5ms
    Cross-encoder rerank300 → 505-15ms1.5-4.5s
    LLM validation50 → 10200-500ms2-5s
    The bottleneck is always the cross-encoder and LLM stages. Two optimizations:

    1. Reduce candidates aggressively. Going from 300 to 100 cross-encoder inputs cuts reranking time by 67%. 2. Batch processing. Cross-encoders and LLMs process batches efficiently on GPUs. Scoring 50 pairs in one batch is much faster than 50 individual calls. 3. Early termination. If the top result has a score far above the second result, skip remaining validations.

    When to Skip Stages



    Not every query needs every stage:

  20. High-confidence queries: If Stage 1 returns a result with similarity > 0.95, skip reranking — the embedding model is confident.
  21. Exact match queries: If the query is a specific ID or exact phrase, BM25 alone suffices — skip vector search entirely.
  22. Simple filters: If the query is just "all videos from last Tuesday," skip all retrieval stages and use a metadata query.


  23. An agent-driven pipeline can dynamically choose which stages to run based on query analysis.

    Evaluation: Measuring Pipeline Quality



    Stage-Level Metrics



    Each stage should be evaluated independently:

    Stage 1 (recall): Recall@K — what fraction of relevant items are in the top-K candidates? Aim for Recall@1000 > 0.95.

    Stage 2 (filtering): Precision after filtering — what fraction of remaining candidates are relevant? Also measure filter attrition: if filtering removes 90% of candidates, your top-K from Stage 1 might be too small.

    Stage 3 (reranking): NDCG@20 — are the best results ranked highest? Compare NDCG before and after reranking to measure the reranker's contribution.

    Stage 4 (validation): Precision@10 — in the final result set, what fraction are truly relevant? This is the metric users see.

    End-to-End Metrics



    Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant result. For navigational queries ("find the specific meeting about X"), MRR is the most important metric.

    NDCG@K: Normalized discounted cumulative gain. For exploratory queries ("find all discussions about memory management"), NDCG captures the quality of the entire ranked list, not just the top result.

    Latency at P95: The 95th percentile latency. Multi-stage pipelines have high variance — some queries hit every stage, others terminate early. P95 captures the worst-case user experience.

    Common Pitfalls



    Optimizing stages independently. Improving Stage 1 recall might hurt Stage 3 reranking by flooding it with marginally relevant candidates. Always measure end-to-end metrics after changing any stage.

    Over-relying on the LLM stage. If your LLM validation is doing most of the ranking work, your earlier stages are not doing their job. The pipeline should converge on good results progressively — each stage should improve the ranking measurably.

    Using the same model for retrieval and reranking. A bi-encoder trained for broad recall makes a poor reranker. A cross-encoder trained for precision makes a poor candidate generator (too slow). Use different models for different stages.

    Not measuring recall at Stage 1. If Stage 1 misses relevant items, no later stage can recover them. Measure recall separately from the full pipeline, and invest in recall improvements first.

    Fixed pipelines for all queries. A keyword query ("patent #US12345678") and a semantic query ("discussions about our patent strategy") need completely different pipeline configurations. Agent-driven query analysis that routes to the right pipeline configuration is worth the engineering investment.

    Further Reading



  24. How to Build a Multimodal RAG Pipeline -- the end-to-end RAG architecture that wraps retrieval
  25. Contrastive Learning -- how the embedding models used in Stage 1 actually learn
  26. Audio Feature Extraction -- extracting features for audio retrieval stages
  27. Open-Vocabulary Object Detection -- generating object-level features for filtering stages
  28. Feature Extractors -- browse all available extraction models
  29. Models -- compare embedding models, rerankers, and more
  30. Already have embeddings?

    Skip extraction — bring your own vectors to MVS. Dense + sparse + BM25 hybrid search. First 1M vectors free.

    Build a Multimodal Search Pipeline

    Give agents searchable access to video, image, audio, and document evidence with Mixpeek.

    Start BuildingRead Docs