> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Retriever Stages > The core of the warehouse query layer: composable retriever stages for multi-stage search pipelines Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components. Where a traditional database returns rows, the warehouse query layer returns ranked, enriched, and fused multimodal results assembled stage by stage. ## Stage Categories Stages are organized into six categories based on how they transform the document set: Reduce the document set by matching criteria. Outputs a subset of input documents. **Stages**: feature\_search, attribute\_filter, llm\_filter, agent\_search, query\_expand Reorder documents by relevance or field values. Same documents, different order. **Stages**: sort\_relevance, sort\_attribute, mmr, rerank, score\_normalize Collapse results into aggregated values. Produces a single value or smaller set from the input. **Stages**: aggregate, temporal, sample, summarize, limit, deduplicate, moment\_group, score\_threshold Reshape results by bucketing documents into logical groups or clusters. **Stages**: group\_by, cluster Transform or restructure documents. May reshape fields, create new documents, or call external services. **Stages**: json\_transform, rag\_prepare, external\_web\_search, api\_call, sql\_lookup, cross\_compare, web\_scrape, unwind, code\_execution Add knowledge to documents using AI models, taxonomies, or cross-collection joins. **Stages**: llm\_enrich, taxonomy\_enrich, document\_enrich, agentic\_enrich ## All Stages ### Filter Stages | Stage | Description | | ------------------------------------------------------ | ------------------------------------------------------------------------------- | | [Feature Search](/retrieval/stages/feature-search) | Search by vector similarity using multimodal embeddings | | [Attribute Filter](/retrieval/stages/attribute-filter) | Filter by metadata fields with boolean logic (AND/OR/NOT) | | [LLM Filter](/retrieval/stages/llm-filter) | Semantic filtering using LLM-based evaluation | | [Agent Search](/retrieval/stages/agent-search) | LLM-driven multi-step retrieval with iterative reasoning and tool orchestration | | [Query Expand](/retrieval/stages/query-expand) | LLM-powered query expansion with RRF result fusion | ### Sort Stages | Stage | Description | | ---------------------------------------------------- | ---------------------------------------------------------- | | [Sort Relevance](/retrieval/stages/sort-relevance) | Reorder by relevance scores | | [Sort Attribute](/retrieval/stages/sort-attribute) | Order by any metadata field (dates, price, etc.) | | [MMR](/retrieval/stages/mmr) | Diversify results with Maximal Marginal Relevance | | [Rerank](/retrieval/stages/rerank) | Re-score with cross-encoder models (e.g., BGE reranker) | | [Score Normalize](/retrieval/stages/score-normalize) | Rescale scores to a common range for consistent comparison | ### Reduce Stages | Stage | Description | | ---------------------------------------------------- | ------------------------------------------------------------------------------ | | [Aggregate](/retrieval/stages/aggregate) | Compute COUNT, SUM, AVG, percentile, stddev, frequency, correlation on results | | [Temporal](/retrieval/stages/temporal) | Group by time windows (hour/day/week/month/quarter/year) with drift detection | | [Sample](/retrieval/stages/sample) | Random or stratified sampling of results | | [Summarize](/retrieval/stages/summarize) | Condense documents into an LLM-generated summary | | [Limit](/retrieval/stages/limit) | Truncate results to a maximum count with optional offset | | [Deduplicate](/retrieval/stages/deduplicate) | Remove duplicate documents by field or content similarity | | [Moment Group](/retrieval/stages/moment-group) | Merge contiguous temporal intervals into consolidated video moments | | [Score Threshold](/retrieval/stages/score-threshold) | Drop results below an absolute score; return none when nothing qualifies | ### Group Stages | Stage | Description | | -------------------------------------- | ---------------------------------------------------- | | [Group By](/retrieval/stages/group-by) | Group documents by field value (decompose/recompose) | | [Cluster](/retrieval/stages/cluster) | Discover themes via embedding-based clustering | ### Apply Stages | Stage | Description | | ------------------------------------------------------------ | ---------------------------------------------------------- | | [JSON Transform](/retrieval/stages/json-transform) | Reshape documents using Jinja2 templates | | [RAG Prepare](/retrieval/stages/rag-prepare) | Format for LLM context with token management and citations | | [External Web Search](/retrieval/stages/external-web-search) | Augment with Exa AI-native web search | | [API Call](/retrieval/stages/api-call) | Enrich with external REST API responses | | [SQL Lookup](/retrieval/stages/sql-lookup) | Join with PostgreSQL/Snowflake data | | [Cross Compare](/retrieval/stages/cross-compare) | Multi-tier cross-collection matching with classification | | [Web Scrape](/retrieval/stages/web-scrape) | Extract full page content from URLs | | [Unwind](/retrieval/stages/unwind) | Decompose array fields into separate documents | | [Code Execution](/retrieval/stages/code-execution) | Execute Python/TypeScript/JavaScript in sandboxes | ### Enrich Stages | Stage | Description | | ---------------------------------------------------- | ------------------------------------------------------------ | | [LLM Enrich](/retrieval/stages/llm-enrich) | Generate new fields with LLM prompts | | [Taxonomy Enrich](/retrieval/stages/taxonomy-enrich) | Classify documents against taxonomy nodes | | [Document Enrich](/retrieval/stages/document-enrich) | Cross-collection joins (LEFT JOIN) | | [Agentic Enrich](/retrieval/stages/agentic-enrich) | Multi-turn agent with tool access for complex classification | ## Pipeline Patterns ### Basic RAG Pipeline ```json theme={null} [ { "stage_name": "feature_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 50}], "final_top_k": 50 } } }, { "stage_name": "rerank", "stage_type": "sort", "config": { "stage_id": "rerank", "parameters": { "inference_name": "BAAI__bge_reranker_v2_m3", "query": "{{INPUT.query}}", "document_field": "content", "top_k": 10 } } }, { "stage_name": "rag_prepare", "stage_type": "apply", "config": { "stage_id": "rag_prepare", "parameters": { "max_tokens": 8000, "output_mode": "single_context" } } } ] ``` ### E-Commerce Search ```json theme={null} [ { "stage_name": "feature_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}], "final_top_k": 100 } } }, { "stage_name": "attribute_filter", "stage_type": "filter", "config": { "stage_id": "attribute_filter", "parameters": { "AND": [ {"field": "metadata.in_stock", "operator": "eq", "value": true}, {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"} ] } } }, { "stage_name": "sort_attribute", "stage_type": "sort", "config": { "stage_id": "sort_attribute", "parameters": { "field": "metadata.{{INPUT.sort_by}}", "direction": "{{INPUT.sort_order}}" } } } ] ``` ### Research Assistant ```json theme={null} [ { "stage_name": "feature_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}], "final_top_k": 100 } } }, { "stage_name": "external_web_search", "stage_type": "apply", "config": { "stage_id": "external_web_search", "parameters": { "query": "{{INPUT.query}}", "num_results": 10, "category": "research_paper" } } }, { "stage_name": "rerank", "stage_type": "sort", "config": { "stage_id": "rerank", "parameters": { "inference_name": "BAAI__bge_reranker_v2_m3", "query": "{{INPUT.query}}", "top_k": 15 } } }, { "stage_name": "summarize", "stage_type": "reduce", "config": { "stage_id": "summarize", "parameters": { "provider": "google", "model_name": "gemini-2.5-flash-lite", "prompt": "Synthesize findings on: {{INPUT.query}}" } } } ] ``` ### Enriched Document Retrieval ```json theme={null} [ { "stage_name": "feature_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 20}], "final_top_k": 20 } } }, { "stage_name": "document_enrich", "stage_type": "enrich", "config": { "stage_id": "document_enrich", "parameters": { "target_collection_id": "col_users", "source_field": "metadata.author_id", "target_field": "user_id", "output_field": "author" } } }, { "stage_name": "llm_enrich", "stage_type": "enrich", "config": { "stage_id": "llm_enrich", "parameters": { "provider": "openai", "model_name": "gpt-4o-mini", "prompt": "Extract key topics and entities from: {{DOC.content}}", "output_field": "analysis" } } }, { "stage_name": "rerank", "stage_type": "sort", "config": { "stage_id": "rerank", "parameters": { "inference_name": "BAAI__bge_reranker_v2_m3", "query": "{{INPUT.query}}", "top_k": 10 } } } ] ``` ## Stage Selection Guide | Goal | Recommended Stage | | ----------------------------------------- | --------------------- | | Find semantically similar documents | feature\_search | | Filter by metadata fields | attribute\_filter | | Filter by content meaning | llm\_filter | | Improve recall with query variations | query\_expand | | Get best relevance ranking | rerank | | Order by price/date/rating | sort\_attribute | | Re-sort by relevance scores | sort\_relevance | | Diversify results | mmr | | Normalize scores across sources | score\_normalize | | Suppress weak results / "no good results" | score\_threshold | | Truncate to top-N results | limit | | Remove duplicate results | deduplicate | | Expand array fields to documents | unwind | | Answer questions from docs | summarize | | Compute statistics on results | aggregate | | Find themes in results | cluster | | Locate moments/scenes in video | moment\_group | | Group by category/author | group\_by | | Random/stratified sampling | sample | | Add external API data | api\_call | | Add database data | sql\_lookup | | Join Mixpeek collections | document\_enrich | | Classify documents | taxonomy\_enrich | | Complex multi-step classification | agentic\_enrich | | Generate new fields with LLM | llm\_enrich | | Transform document structure | json\_transform | | Prepare for LLM context | rag\_prepare | | Custom code transformations | code\_execution | | Add web search results | external\_web\_search | | Extract URL content | web\_scrape | ## Performance Considerations | Stage | Typical Latency | Cost | | --------------------- | --------------- | -------------------- | | feature\_search | 5-50ms | Index storage | | attribute\_filter | \< 5ms | Free | | llm\_filter | 200-500ms | LLM API | | query\_expand | 300-800ms | LLM API | | rerank | 50-100ms | Inference | | sort\_attribute | \< 5ms | Free | | sort\_relevance | \< 5ms | Free | | mmr | 10-50ms | Free | | score\_normalize | \< 1ms | Free | | score\_threshold | \< 1ms | Free | | limit | \< 1ms | Free | | deduplicate | 5-50ms | Free | | unwind | \< 5ms | Free | | summarize | 500-2000ms | LLM API | | aggregate | 5-50ms | Free | | cluster | 50-200ms | Inference | | moment\_group | 5-50ms | Free | | group\_by | 5-20ms | Free | | sample | \< 5ms | Free | | llm\_enrich | 300-800ms | LLM API | | agentic\_enrich | 2-30s | LLM API (multi-turn) | | api\_call | 50-500ms | External API | | sql\_lookup | 10-100ms | Database | | code\_execution | 5-50ms | Free | | rag\_prepare | \< 10ms | Free | | json\_transform | \< 5ms | Free | | external\_web\_search | 100-500ms | Exa API | | taxonomy\_enrich | 20-100ms | Inference | | document\_enrich | 10-50ms | Database | | web\_scrape | 500-5000ms | External | Order stages efficiently: cheap operations (filters, sorts) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing. ## Template Variables All stages support template variables for dynamic configuration: | Variable | Description | | --------------- | ---------------------------------------------------- | | `{{INPUT.*}}` | Input parameters from retriever call | | `{{DOC.*}}` | Document fields (in APPLY, ENRICH, and GROUP stages) | | `{{CONTEXT.*}}` | Pipeline context (index, citations) | ```json theme={null} { "stage_name": "attribute_filter", "stage_type": "filter", "config": { "stage_id": "attribute_filter", "parameters": { "field": "metadata.tenant_id", "operator": "eq", "value": "{{INPUT.tenant_id}}" } } } ```