Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components. Where a traditional database returns rows, the warehouse query layer returns ranked, enriched, and fused multimodal results assembled stage by stage.Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Stage Categories
Stages are organized into six categories based on how they transform the document set:FILTER
Reduce the document set by matching criteria. Outputs a subset of input documents.Stages: feature_search, attribute_filter, llm_filter, agent_search, query_expand
SORT
Reorder documents by relevance or field values. Same documents, different order.Stages: sort_relevance, sort_attribute, mmr, rerank, score_normalize
REDUCE
Collapse results into aggregated values. Produces a single value or smaller set from the input.Stages: aggregate, temporal, sample, summarize, limit, deduplicate
GROUP
Reshape results by bucketing documents into logical groups or clusters.Stages: group_by, cluster
APPLY
Transform or restructure documents. May reshape fields, create new documents, or call external services.Stages: json_transform, rag_prepare, external_web_search, api_call, sql_lookup, cross_compare, web_scrape, unwind, code_execution
ENRICH
Add knowledge to documents using AI models, taxonomies, or cross-collection joins.Stages: llm_enrich, taxonomy_enrich, document_enrich, agentic_enrich
All Stages
Filter Stages
| Stage | Description |
|---|---|
| Feature Search | Search by vector similarity using multimodal embeddings |
| Attribute Filter | Filter by metadata fields with boolean logic (AND/OR/NOT) |
| LLM Filter | Semantic filtering using LLM-based evaluation |
| Agent Search | LLM-driven multi-step retrieval with iterative reasoning and tool orchestration |
| Query Expand | LLM-powered query expansion with RRF result fusion |
Sort Stages
| Stage | Description |
|---|---|
| Sort Relevance | Reorder by relevance scores |
| Sort Attribute | Order by any metadata field (dates, price, etc.) |
| MMR | Diversify results with Maximal Marginal Relevance |
| Rerank | Re-score with cross-encoder models (e.g., BGE reranker) |
| Score Normalize | Rescale scores to a common range for consistent comparison |
Reduce Stages
| Stage | Description |
|---|---|
| Aggregate | Compute COUNT, SUM, AVG, percentile, stddev, frequency, correlation on results |
| Temporal | Group by time windows (hour/day/week/month/quarter/year) with drift detection |
| Sample | Random or stratified sampling of results |
| Summarize | Condense documents into an LLM-generated summary |
| Limit | Truncate results to a maximum count with optional offset |
| Deduplicate | Remove duplicate documents by field or content similarity |
Group Stages
Apply Stages
| Stage | Description |
|---|---|
| JSON Transform | Reshape documents using Jinja2 templates |
| RAG Prepare | Format for LLM context with token management and citations |
| External Web Search | Augment with Exa AI-native web search |
| API Call | Enrich with external REST API responses |
| SQL Lookup | Join with PostgreSQL/Snowflake data |
| Cross Compare | Multi-tier cross-collection matching with classification |
| Web Scrape | Extract full page content from URLs |
| Unwind | Decompose array fields into separate documents |
| Code Execution | Execute Python/TypeScript/JavaScript in sandboxes |
Enrich Stages
| Stage | Description |
|---|---|
| LLM Enrich | Generate new fields with LLM prompts |
| Taxonomy Enrich | Classify documents against taxonomy nodes |
| Document Enrich | Cross-collection joins (LEFT JOIN) |
| Agentic Enrich | Multi-turn agent with tool access for complex classification |
Pipeline Patterns
Basic RAG Pipeline
E-Commerce Search
Research Assistant
Enriched Document Retrieval
Stage Selection Guide
| Goal | Recommended Stage |
|---|---|
| Find semantically similar documents | feature_search |
| Filter by metadata fields | attribute_filter |
| Filter by content meaning | llm_filter |
| Improve recall with query variations | query_expand |
| Get best relevance ranking | rerank |
| Order by price/date/rating | sort_attribute |
| Re-sort by relevance scores | sort_relevance |
| Diversify results | mmr |
| Normalize scores across sources | score_normalize |
| Truncate to top-N results | limit |
| Remove duplicate results | deduplicate |
| Expand array fields to documents | unwind |
| Answer questions from docs | summarize |
| Compute statistics on results | aggregate |
| Find themes in results | cluster |
| Group by category/author | group_by |
| Random/stratified sampling | sample |
| Add external API data | api_call |
| Add database data | sql_lookup |
| Join Mixpeek collections | document_enrich |
| Classify documents | taxonomy_enrich |
| Complex multi-step classification | agentic_enrich |
| Generate new fields with LLM | llm_enrich |
| Transform document structure | json_transform |
| Prepare for LLM context | rag_prepare |
| Custom code transformations | code_execution |
| Add web search results | external_web_search |
| Extract URL content | web_scrape |
Performance Considerations
| Stage | Typical Latency | Cost |
|---|---|---|
| feature_search | 5-50ms | Index storage |
| attribute_filter | < 5ms | Free |
| llm_filter | 200-500ms | LLM API |
| query_expand | 300-800ms | LLM API |
| rerank | 50-100ms | Inference |
| sort_attribute | < 5ms | Free |
| sort_relevance | < 5ms | Free |
| mmr | 10-50ms | Free |
| score_normalize | < 1ms | Free |
| limit | < 1ms | Free |
| deduplicate | 5-50ms | Free |
| unwind | < 5ms | Free |
| summarize | 500-2000ms | LLM API |
| aggregate | 5-50ms | Free |
| cluster | 50-200ms | Inference |
| group_by | 5-20ms | Free |
| sample | < 5ms | Free |
| llm_enrich | 300-800ms | LLM API |
| agentic_enrich | 2-30s | LLM API (multi-turn) |
| api_call | 50-500ms | External API |
| sql_lookup | 10-100ms | Database |
| code_execution | 5-50ms | Free |
| rag_prepare | < 10ms | Free |
| json_transform | < 5ms | Free |
| external_web_search | 100-500ms | Exa API |
| taxonomy_enrich | 20-100ms | Inference |
| document_enrich | 10-50ms | Database |
| web_scrape | 500-5000ms | External |
Template Variables
All stages support template variables for dynamic configuration:| Variable | Description |
|---|---|
{{INPUT.*}} | Input parameters from retriever call |
{{DOC.*}} | Document fields (in APPLY, ENRICH, and GROUP stages) |
{{CONTEXT.*}} | Pipeline context (index, citations) |

