> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Retriever Stages

> The core of the warehouse query layer: composable retriever stages for multi-stage search pipelines

Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components. Where a traditional database returns rows, the warehouse query layer returns ranked, enriched, and fused multimodal results assembled stage by stage.

## Stage Categories

Stages are organized into six categories based on how they transform the document set:

<CardGroup cols={2}>
  <Card title="FILTER" icon="filter">
    Reduce the document set by matching criteria. Outputs a subset of input documents.

    **Stages**: feature\_search, attribute\_filter, llm\_filter, agent\_search, query\_expand
  </Card>

  <Card title="SORT" icon="arrow-up-arrow-down">
    Reorder documents by relevance or field values. Same documents, different order.

    **Stages**: sort\_relevance, sort\_attribute, mmr, rerank, score\_normalize
  </Card>

  <Card title="REDUCE" icon="compress">
    Collapse results into aggregated values. Produces a single value or smaller set from the input.

    **Stages**: aggregate, temporal, sample, summarize, limit, deduplicate, moment\_group, score\_threshold
  </Card>

  <Card title="GROUP" icon="layer-group">
    Reshape results by bucketing documents into logical groups or clusters.

    **Stages**: group\_by, cluster
  </Card>

  <Card title="APPLY" icon="wand-magic-sparkles">
    Transform or restructure documents. May reshape fields, create new documents, or call external services.

    **Stages**: json\_transform, rag\_prepare, external\_web\_search, api\_call, sql\_lookup, cross\_compare, web\_scrape, unwind, code\_execution
  </Card>

  <Card title="ENRICH" icon="sparkles">
    Add knowledge to documents using AI models, taxonomies, or cross-collection joins.

    **Stages**: llm\_enrich, taxonomy\_enrich, document\_enrich, agentic\_enrich
  </Card>
</CardGroup>

## All Stages

### Filter Stages

| Stage                                                  | Description                                                                     |
| ------------------------------------------------------ | ------------------------------------------------------------------------------- |
| [Feature Search](/retrieval/stages/feature-search)     | Search by vector similarity using multimodal embeddings                         |
| [Attribute Filter](/retrieval/stages/attribute-filter) | Filter by metadata fields with boolean logic (AND/OR/NOT)                       |
| [LLM Filter](/retrieval/stages/llm-filter)             | Semantic filtering using LLM-based evaluation                                   |
| [Agent Search](/retrieval/stages/agent-search)         | LLM-driven multi-step retrieval with iterative reasoning and tool orchestration |
| [Query Expand](/retrieval/stages/query-expand)         | LLM-powered query expansion with RRF result fusion                              |

### Sort Stages

| Stage                                                | Description                                                |
| ---------------------------------------------------- | ---------------------------------------------------------- |
| [Sort Relevance](/retrieval/stages/sort-relevance)   | Reorder by relevance scores                                |
| [Sort Attribute](/retrieval/stages/sort-attribute)   | Order by any metadata field (dates, price, etc.)           |
| [MMR](/retrieval/stages/mmr)                         | Diversify results with Maximal Marginal Relevance          |
| [Rerank](/retrieval/stages/rerank)                   | Re-score with cross-encoder models (e.g., BGE reranker)    |
| [Score Normalize](/retrieval/stages/score-normalize) | Rescale scores to a common range for consistent comparison |

### Reduce Stages

| Stage                                                | Description                                                                    |
| ---------------------------------------------------- | ------------------------------------------------------------------------------ |
| [Aggregate](/retrieval/stages/aggregate)             | Compute COUNT, SUM, AVG, percentile, stddev, frequency, correlation on results |
| [Temporal](/retrieval/stages/temporal)               | Group by time windows (hour/day/week/month/quarter/year) with drift detection  |
| [Sample](/retrieval/stages/sample)                   | Random or stratified sampling of results                                       |
| [Summarize](/retrieval/stages/summarize)             | Condense documents into an LLM-generated summary                               |
| [Limit](/retrieval/stages/limit)                     | Truncate results to a maximum count with optional offset                       |
| [Deduplicate](/retrieval/stages/deduplicate)         | Remove duplicate documents by field or content similarity                      |
| [Moment Group](/retrieval/stages/moment-group)       | Merge contiguous temporal intervals into consolidated video moments            |
| [Score Threshold](/retrieval/stages/score-threshold) | Drop results below an absolute score; return none when nothing qualifies       |

### Group Stages

| Stage                                  | Description                                          |
| -------------------------------------- | ---------------------------------------------------- |
| [Group By](/retrieval/stages/group-by) | Group documents by field value (decompose/recompose) |
| [Cluster](/retrieval/stages/cluster)   | Discover themes via embedding-based clustering       |

### Apply Stages

| Stage                                                        | Description                                                |
| ------------------------------------------------------------ | ---------------------------------------------------------- |
| [JSON Transform](/retrieval/stages/json-transform)           | Reshape documents using Jinja2 templates                   |
| [RAG Prepare](/retrieval/stages/rag-prepare)                 | Format for LLM context with token management and citations |
| [External Web Search](/retrieval/stages/external-web-search) | Augment with Exa AI-native web search                      |
| [API Call](/retrieval/stages/api-call)                       | Enrich with external REST API responses                    |
| [SQL Lookup](/retrieval/stages/sql-lookup)                   | Join with PostgreSQL/Snowflake data                        |
| [Cross Compare](/retrieval/stages/cross-compare)             | Multi-tier cross-collection matching with classification   |
| [Web Scrape](/retrieval/stages/web-scrape)                   | Extract full page content from URLs                        |
| [Unwind](/retrieval/stages/unwind)                           | Decompose array fields into separate documents             |
| [Code Execution](/retrieval/stages/code-execution)           | Execute Python/TypeScript/JavaScript in sandboxes          |

### Enrich Stages

| Stage                                                | Description                                                  |
| ---------------------------------------------------- | ------------------------------------------------------------ |
| [LLM Enrich](/retrieval/stages/llm-enrich)           | Generate new fields with LLM prompts                         |
| [Taxonomy Enrich](/retrieval/stages/taxonomy-enrich) | Classify documents against taxonomy nodes                    |
| [Document Enrich](/retrieval/stages/document-enrich) | Cross-collection joins (LEFT JOIN)                           |
| [Agentic Enrich](/retrieval/stages/agentic-enrich)   | Multi-turn agent with tool access for complex classification |

## Pipeline Patterns

### Basic RAG Pipeline

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 50}],
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "document_field": "content",
        "top_k": 10
      }
    }
  },
  {
    "stage_name": "rag_prepare",
    "stage_type": "apply",
    "config": {
      "stage_id": "rag_prepare",
      "parameters": {
        "max_tokens": 8000,
        "output_mode": "single_context"
      }
    }
  }
]
```

### E-Commerce Search

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "attribute_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "AND": [
          {"field": "metadata.in_stock", "operator": "eq", "value": true},
          {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
        ]
      }
    }
  },
  {
    "stage_name": "sort_attribute",
    "stage_type": "sort",
    "config": {
      "stage_id": "sort_attribute",
      "parameters": {
        "field": "metadata.{{INPUT.sort_by}}",
        "direction": "{{INPUT.sort_order}}"
      }
    }
  }
]
```

### Research Assistant

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "external_web_search",
    "stage_type": "apply",
    "config": {
      "stage_id": "external_web_search",
      "parameters": {
        "query": "{{INPUT.query}}",
        "num_results": 10,
        "category": "research_paper"
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "top_k": 15
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Synthesize findings on: {{INPUT.query}}"
      }
    }
  }
]
```

### Enriched Document Retrieval

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 20}],
        "final_top_k": 20
      }
    }
  },
  {
    "stage_name": "document_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "col_users",
        "source_field": "metadata.author_id",
        "target_field": "user_id",
        "output_field": "author"
      }
    }
  },
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Extract key topics and entities from: {{DOC.content}}",
        "output_field": "analysis"
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "top_k": 10
      }
    }
  }
]
```

## Stage Selection Guide

| Goal                                      | Recommended Stage     |
| ----------------------------------------- | --------------------- |
| Find semantically similar documents       | feature\_search       |
| Filter by metadata fields                 | attribute\_filter     |
| Filter by content meaning                 | llm\_filter           |
| Improve recall with query variations      | query\_expand         |
| Get best relevance ranking                | rerank                |
| Order by price/date/rating                | sort\_attribute       |
| Re-sort by relevance scores               | sort\_relevance       |
| Diversify results                         | mmr                   |
| Normalize scores across sources           | score\_normalize      |
| Suppress weak results / "no good results" | score\_threshold      |
| Truncate to top-N results                 | limit                 |
| Remove duplicate results                  | deduplicate           |
| Expand array fields to documents          | unwind                |
| Answer questions from docs                | summarize             |
| Compute statistics on results             | aggregate             |
| Find themes in results                    | cluster               |
| Locate moments/scenes in video            | moment\_group         |
| Group by category/author                  | group\_by             |
| Random/stratified sampling                | sample                |
| Add external API data                     | api\_call             |
| Add database data                         | sql\_lookup           |
| Join Mixpeek collections                  | document\_enrich      |
| Classify documents                        | taxonomy\_enrich      |
| Complex multi-step classification         | agentic\_enrich       |
| Generate new fields with LLM              | llm\_enrich           |
| Transform document structure              | json\_transform       |
| Prepare for LLM context                   | rag\_prepare          |
| Custom code transformations               | code\_execution       |
| Add web search results                    | external\_web\_search |
| Extract URL content                       | web\_scrape           |

## Performance Considerations

| Stage                 | Typical Latency | Cost                 |
| --------------------- | --------------- | -------------------- |
| feature\_search       | 5-50ms          | Index storage        |
| attribute\_filter     | \< 5ms          | Free                 |
| llm\_filter           | 200-500ms       | LLM API              |
| query\_expand         | 300-800ms       | LLM API              |
| rerank                | 50-100ms        | Inference            |
| sort\_attribute       | \< 5ms          | Free                 |
| sort\_relevance       | \< 5ms          | Free                 |
| mmr                   | 10-50ms         | Free                 |
| score\_normalize      | \< 1ms          | Free                 |
| score\_threshold      | \< 1ms          | Free                 |
| limit                 | \< 1ms          | Free                 |
| deduplicate           | 5-50ms          | Free                 |
| unwind                | \< 5ms          | Free                 |
| summarize             | 500-2000ms      | LLM API              |
| aggregate             | 5-50ms          | Free                 |
| cluster               | 50-200ms        | Inference            |
| moment\_group         | 5-50ms          | Free                 |
| group\_by             | 5-20ms          | Free                 |
| sample                | \< 5ms          | Free                 |
| llm\_enrich           | 300-800ms       | LLM API              |
| agentic\_enrich       | 2-30s           | LLM API (multi-turn) |
| api\_call             | 50-500ms        | External API         |
| sql\_lookup           | 10-100ms        | Database             |
| code\_execution       | 5-50ms          | Free                 |
| rag\_prepare          | \< 10ms         | Free                 |
| json\_transform       | \< 5ms          | Free                 |
| external\_web\_search | 100-500ms       | Exa API              |
| taxonomy\_enrich      | 20-100ms        | Inference            |
| document\_enrich      | 10-50ms         | Database             |
| web\_scrape           | 500-5000ms      | External             |

<Tip>
  Order stages efficiently: cheap operations (filters, sorts) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.
</Tip>

## Template Variables

All stages support template variables for dynamic configuration:

| Variable        | Description                                          |
| --------------- | ---------------------------------------------------- |
| `{{INPUT.*}}`   | Input parameters from retriever call                 |
| `{{DOC.*}}`     | Document fields (in APPLY, ENRICH, and GROUP stages) |
| `{{CONTEXT.*}}` | Pipeline context (index, citations)                  |

```json theme={null}
{
  "stage_name": "attribute_filter",
  "stage_type": "filter",
  "config": {
    "stage_id": "attribute_filter",
    "parameters": {
      "field": "metadata.tenant_id",
      "operator": "eq",
      "value": "{{INPUT.tenant_id}}"
    }
  }
}
```
