Retriever Stages

Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components. Where a traditional database returns rows, the warehouse query layer returns ranked, enriched, and fused multimodal results assembled stage by stage.

Stage Categories

Stages are organized into six categories based on how they transform the document set:

FILTER

Reduce the document set by matching criteria. Outputs a subset of input documents.Stages: feature_search, attribute_filter, llm_filter, agent_search, query_expand

SORT

Reorder documents by relevance or field values. Same documents, different order.Stages: sort_relevance, sort_attribute, mmr, rerank, score_normalize

REDUCE

Collapse results into aggregated values. Produces a single value or smaller set from the input.Stages: aggregate, temporal, sample, summarize, limit, deduplicate, moment_group, score_threshold

GROUP

Reshape results by bucketing documents into logical groups or clusters.Stages: group_by, cluster

APPLY

Transform or restructure documents. May reshape fields, create new documents, or call external services.Stages: json_transform, rag_prepare, external_web_search, api_call, sql_lookup, cross_compare, web_scrape, unwind, code_execution

ENRICH

Add knowledge to documents using AI models, taxonomies, or cross-collection joins.Stages: llm_enrich, taxonomy_enrich, document_enrich, agentic_enrich

All Stages

Filter Stages

Stage	Description
Feature Search	Search by vector similarity using multimodal embeddings
Attribute Filter	Filter by metadata fields with boolean logic (AND/OR/NOT)
LLM Filter	Semantic filtering using LLM-based evaluation
Agent Search	LLM-driven multi-step retrieval with iterative reasoning and tool orchestration
Query Expand	LLM-powered query expansion with RRF result fusion

Sort Stages

Stage	Description
Sort Relevance	Reorder by relevance scores
Sort Attribute	Order by any metadata field (dates, price, etc.)
MMR	Diversify results with Maximal Marginal Relevance
Rerank	Re-score with cross-encoder models (e.g., BGE reranker)
Score Normalize	Rescale scores to a common range for consistent comparison

Reduce Stages

Stage	Description
Aggregate	Compute COUNT, SUM, AVG, percentile, stddev, frequency, correlation on results
Temporal	Group by time windows (hour/day/week/month/quarter/year) with drift detection
Sample	Random or stratified sampling of results
Summarize	Condense documents into an LLM-generated summary
Limit	Truncate results to a maximum count with optional offset
Deduplicate	Remove duplicate documents by field or content similarity
Moment Group	Merge contiguous temporal intervals into consolidated video moments
Score Threshold	Drop results below an absolute score; return none when nothing qualifies

Group Stages

Stage	Description
Group By	Group documents by field value (decompose/recompose)
Cluster	Discover themes via embedding-based clustering

Apply Stages

Stage	Description
JSON Transform	Reshape documents using Jinja2 templates
RAG Prepare	Format for LLM context with token management and citations
External Web Search	Augment with Exa AI-native web search
API Call	Enrich with external REST API responses
SQL Lookup	Join with PostgreSQL/Snowflake data
Cross Compare	Multi-tier cross-collection matching with classification
Web Scrape	Extract full page content from URLs
Unwind	Decompose array fields into separate documents
Code Execution	Execute Python/TypeScript/JavaScript in sandboxes

Enrich Stages

Stage	Description
LLM Enrich	Generate new fields with LLM prompts
Taxonomy Enrich	Classify documents against taxonomy nodes
Document Enrich	Cross-collection joins (LEFT JOIN)
Agentic Enrich	Multi-turn agent with tool access for complex classification

Pipeline Patterns

Basic RAG Pipeline

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 50}],
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "document_field": "content",
        "top_k": 10
      }
    }
  },
  {
    "stage_name": "rag_prepare",
    "stage_type": "apply",
    "config": {
      "stage_id": "rag_prepare",
      "parameters": {
        "max_tokens": 8000,
        "output_mode": "single_context"
      }
    }
  }
]

E-Commerce Search

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "attribute_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "AND": [
          {"field": "metadata.in_stock", "operator": "eq", "value": true},
          {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
        ]
      }
    }
  },
  {
    "stage_name": "sort_attribute",
    "stage_type": "sort",
    "config": {
      "stage_id": "sort_attribute",
      "parameters": {
        "field": "metadata.{{INPUT.sort_by}}",
        "direction": "{{INPUT.sort_order}}"
      }
    }
  }
]

Research Assistant

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "external_web_search",
    "stage_type": "apply",
    "config": {
      "stage_id": "external_web_search",
      "parameters": {
        "query": "{{INPUT.query}}",
        "num_results": 10,
        "category": "research_paper"
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "top_k": 15
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Synthesize findings on: {{INPUT.query}}"
      }
    }
  }
]

Enriched Document Retrieval

[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 20}],
        "final_top_k": 20
      }
    }
  },
  {
    "stage_name": "document_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "col_users",
        "source_field": "metadata.author_id",
        "target_field": "user_id",
        "output_field": "author"
      }
    }
  },
  {
    "stage_name": "llm_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o-mini",
        "prompt": "Extract key topics and entities from: {{DOC.content}}",
        "output_field": "analysis"
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "top_k": 10
      }
    }
  }
]

Stage Selection Guide

Goal	Recommended Stage
Find semantically similar documents	feature_search
Filter by metadata fields	attribute_filter
Filter by content meaning	llm_filter
Improve recall with query variations	query_expand
Get best relevance ranking	rerank
Order by price/date/rating	sort_attribute
Re-sort by relevance scores	sort_relevance
Diversify results	mmr
Normalize scores across sources	score_normalize
Suppress weak results / “no good results”	score_threshold
Truncate to top-N results	limit
Remove duplicate results	deduplicate
Expand array fields to documents	unwind
Answer questions from docs	summarize
Compute statistics on results	aggregate
Find themes in results	cluster
Locate moments/scenes in video	moment_group
Group by category/author	group_by
Random/stratified sampling	sample
Add external API data	api_call
Add database data	sql_lookup
Join Mixpeek collections	document_enrich
Classify documents	taxonomy_enrich
Complex multi-step classification	agentic_enrich
Generate new fields with LLM	llm_enrich
Transform document structure	json_transform
Prepare for LLM context	rag_prepare
Custom code transformations	code_execution
Add web search results	external_web_search
Extract URL content	web_scrape

Performance Considerations

Stage	Typical Latency	Cost
feature_search	5-50ms	Index storage
attribute_filter	< 5ms	Free
llm_filter	200-500ms	LLM API
query_expand	300-800ms	LLM API
rerank	50-100ms	Inference
sort_attribute	< 5ms	Free
sort_relevance	< 5ms	Free
mmr	10-50ms	Free
score_normalize	< 1ms	Free
score_threshold	< 1ms	Free
limit	< 1ms	Free
deduplicate	5-50ms	Free
unwind	< 5ms	Free
summarize	500-2000ms	LLM API
aggregate	5-50ms	Free
cluster	50-200ms	Inference
moment_group	5-50ms	Free
group_by	5-20ms	Free
sample	< 5ms	Free
llm_enrich	300-800ms	LLM API
agentic_enrich	2-30s	LLM API (multi-turn)
api_call	50-500ms	External API
sql_lookup	10-100ms	Database
code_execution	5-50ms	Free
rag_prepare	< 10ms	Free
json_transform	< 5ms	Free
external_web_search	100-500ms	Exa API
taxonomy_enrich	20-100ms	Inference
document_enrich	10-50ms	Database
web_scrape	500-5000ms	External

Order stages efficiently: cheap operations (filters, sorts) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.

Template Variables

All stages support template variables for dynamic configuration:

Variable	Description
`{{INPUT.*}}`	Input parameters from retriever call
`{{DOC.*}}`	Document fields (in APPLY, ENRICH, and GROUP stages)
`{{CONTEXT.*}}`	Pipeline context (index, citations)

{
  "stage_name": "attribute_filter",
  "stage_type": "filter",
  "config": {
    "stage_id": "attribute_filter",
    "parameters": {
      "field": "metadata.tenant_id",
      "operator": "eq",
      "value": "{{INPUT.tenant_id}}"
    }
  }
}

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

Retriever Stages

Stage Categories

FILTER

SORT

REDUCE

GROUP

APPLY

ENRICH

All Stages

Filter Stages

Sort Stages

Reduce Stages

Group Stages

Apply Stages

Enrich Stages

Pipeline Patterns

Basic RAG Pipeline

E-Commerce Search

Research Assistant

Enriched Document Retrieval

Stage Selection Guide

Performance Considerations

Template Variables

​Stage Categories

FILTER

SORT

REDUCE

GROUP

APPLY

ENRICH

​All Stages

​Filter Stages

​Sort Stages

​Reduce Stages

​Group Stages

​Apply Stages

​Enrich Stages

​Pipeline Patterns

​Basic RAG Pipeline

​E-Commerce Search

​Research Assistant

​Enriched Document Retrieval

​Stage Selection Guide

​Performance Considerations

​Template Variables

Stage Categories

All Stages

Filter Stages

Sort Stages

Reduce Stages

Group Stages

Apply Stages

Enrich Stages

Pipeline Patterns

Basic RAG Pipeline

E-Commerce Search

Research Assistant

Enriched Document Retrieval

Stage Selection Guide

Performance Considerations

Template Variables