> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Retrievers

> Compose stage-based search pipelines over your collections

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/mixpeek-retrievers.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=3fba235556bd21f93dd44694527ba255" alt="Mixpeek Retrievers" width="1200" height="950" data-path="assets/mixpeek-retrievers.svg" />
</Frame>

Retrievers combine feature-aware search stages, structured filters, enrichment joins, and optional LLM post-processing into a single executable pipeline. Each retriever has an input schema, a list of target collections, and a deterministic set of stages executed in order.

<Note>
  **Multi-stage retrieval is what makes Mixpeek a warehouse, not a database.** No other system offers composable filter → sort → reduce → enrich → apply pipelines over multimodal data. This is the query language for unstructured data — the equivalent of SQL for embeddings across modalities. See the [Retrieval Cookbook](/retrieval/cookbook) for ready-to-copy pipeline configurations.
</Note>

## Anatomy of a Retriever

```json theme={null}
{
  "retriever_name": "product_search_v2",
  "description": "Product search with enrichment and transformation",
  "collection_identifiers": ["col_products"],
  "input_schema": {
    "query_text": { "type": "text", "required": true },
    "max_price": { "type": "number" }
  },
  "stages": [
    {
      "stage_name": "enrich_catalog",
      "stage_type": "enrich",
      "config": {
        "stage_id": "document_enrich",
        "parameters": {
          "target_collection_id": "col_catalog",
          "source_field": "metadata.product_id",
          "target_field": "product_id",
          "fields_to_merge": ["name", "price", "category"],
          "output_field": "catalog_data"
        }
      }
    },
    {
      "stage_name": "fetch_billing",
      "stage_type": "apply",
      "config": {
        "stage_id": "api_call",
        "parameters": {
          "url": "https://api.stripe.com/v1/customers/{{DOC.metadata.customer_id}}",
          "method": "GET",
          "allowed_domains": ["api.stripe.com"],
          "auth": {
            "type": "bearer",
            "secret_ref": "stripe_api_key"
          },
          "output_field": "metadata.billing",
          "on_error": "skip"
        }
      }
    },
    {
      "stage_name": "reshape",
      "stage_type": "apply",
      "config": {
        "stage_id": "json_transform",
        "parameters": {
          "template": "{\"id\": \"{{DOC.document_id}}\", \"title\": \"{{DOC.metadata.title}}\", \"price\": {{DOC.catalog_data.price}}}",
          "fail_on_error": false
        }
      }
    }
  ],
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 300
  }
}
```

## Minimal Working Example

The simplest retriever that performs a feature search. Note the required fields:

* **`collection_identifiers`** is required at the retriever root level when using `feature_search` stages — the search needs to know which collections to query. (Accepts collection names or IDs.)
* **`input_schema`** is a flat map of input field name → type definition. It's required when your stages use `{{INPUT.*}}` template variables — Mixpeek validates inputs against this schema before execution.
* Each stage needs a **`stage_name`** and a **`config`** object containing the **`stage_id`** (e.g. `feature_search`) and its **`parameters`**. `stage_type` (the category, e.g. `filter`) is optional but recommended.

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "basic_search",
    "collection_identifiers": ["col_my_collection"],
    "input_schema": {
      "query": { "type": "text", "required": true }
    },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
                "top_k": 100
              }
            ],
            "final_top_k": 20
          }
        }
      }
    ]
  }'
```

Execute it:

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/<retriever_id>/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{"inputs": {"query": "search terms here"}, "limit": 10}'
```

## Stage Catalog

Stages are the building blocks of retriever pipelines. Each stage belongs to a **category** that defines its behavior:

| Category   | Behavior                                               | Example Use Cases                                               |
| ---------- | ------------------------------------------------------ | --------------------------------------------------------------- |
| **filter** | Reduce the number of documents while preserving schema | Attribute filters, semantic search, hybrid search               |
| **sort**   | Reorder documents without changing the set             | Attribute sort, score-based ordering, reranking                 |
| **reduce** | Collapse results into aggregated values                | Top-k selection, deduplication, sampling, summarization         |
| **group**  | Reshape results by bucketing into logical groups       | Group by field, semantic clustering                             |
| **apply**  | Transform or restructure documents                     | JSON transforms, API calls, code execution, web scrape          |
| **enrich** | Add knowledge to documents using AI or joins           | LLM enrichment, taxonomy classification, cross-collection joins |

<Note>
  Retrieve the live registry with `GET /v1/retrievers/stages`. Each entry includes `stage_id`, category, icon, and parameter schema so you can dynamically build configuration UIs or validations.

  **Live stages:** [https://api.mixpeek.com/v1/retrievers/stages](https://api.mixpeek.com/v1/retrievers/stages)
</Note>

<CodeGroup>
  ```bash theme={null}
  curl -s --request GET \
    --url "$MP_API_URL/v1/retrievers/stages" \
    --header "Authorization: Bearer $MP_API_KEY" \
    --header "X-Namespace: $MP_NAMESPACE"
  ```

  ```json theme={null}
  [
    {
      "stage_id": "api_call",
      "description": "Enrich documents with external API calls",
      "category": "apply",
      "icon": "external-link"
    },
    {
      "stage_id": "json_transform",
      "description": "Transform document structure using Jinja2 templates",
      "category": "apply",
      "icon": "code"
    },
    {
      "stage_id": "external_web_search",
      "description": "Search the web using Exa AI-native search",
      "category": "apply",
      "icon": "globe"
    },
    {
      "stage_id": "document_enrich",
      "description": "Join and enrich documents with data from another collection",
      "category": "enrich",
      "icon": "link"
    },
    {
      "stage_id": "cross_compare",
      "description": "Multi-tier cross-collection comparison with classification",
      "category": "apply",
      "icon": "code-compare"
    }
  ]
  ```
</CodeGroup>

### Filter Stages

Filter stages reduce the document set while preserving the document schema. Use these at the start of your pipeline to narrow down candidates.

<Note>
  Use `GET /v1/retrievers/stages?category=filter` to retrieve the current list of filter stages and their parameter schemas.
</Note>

### Sort Stages

Sort stages reorder documents without changing the result set. Place these after filters to control ranking.

<Note>
  Use `GET /v1/retrievers/stages?category=sort` to retrieve the current list of sort stages and their parameter schemas.
</Note>

### Reduce Stages

Reduce stages collapse results into aggregated values. Use these for deduplication, sampling, or summarization.

<Note>
  Use `GET /v1/retrievers/stages?category=reduce` to retrieve the current list of reduce stages and their parameter schemas.
</Note>

### Group Stages

Group stages reshape results by bucketing documents into logical groups or clusters.

<Note>
  Use `GET /v1/retrievers/stages?category=group` to retrieve the current list of group stages and their parameter schemas.
</Note>

### Apply Stages

Apply stages transform or restructure documents. Use these to reshape output, call external services, or run custom code.

| Stage ID              | Description                                              | Transformation                     |
| --------------------- | -------------------------------------------------------- | ---------------------------------- |
| `cross_compare`       | Multi-tier comparison against a reference collection     | N → M (findings) or N → N (enrich) |
| `api_call`            | Enrich documents with external API calls                 | N → N                              |
| `json_transform`      | Transform document structure using Jinja2 templates      | N → N                              |
| `external_web_search` | Search the web using Exa AI-native search                | 0 → M (creates documents)          |
| `code_execution`      | Execute custom Python/TypeScript/JavaScript in sandboxes | N → M (custom logic)               |

### Enrich Stages

Enrich stages add knowledge to documents using AI models, taxonomies, or cross-collection joins.

| Stage ID          | Description                                      | Transformation                     |
| ----------------- | ------------------------------------------------ | ---------------------------------- |
| `llm_enrich`      | Generate new fields with LLM prompts             | N → N (with extracted data added)  |
| `taxonomy_enrich` | Classify documents against taxonomy nodes        | N → N (with taxonomy labels added) |
| `document_enrich` | Join documents with data from another collection | N → N (LEFT JOIN)                  |

***

## Enrich Stage Details

### document\_enrich

Joins documents with data from another collection, similar to a SQL LEFT JOIN. Each input document produces exactly one output document with added fields from the target collection.

**When to use:**

* Combine data from multiple collections (e.g., products + catalog info)
* Attach user profiles, metadata, or related entities
* Denormalize data at query time

**Parameters:**

| Parameter              | Required | Description                                                           |
| ---------------------- | -------- | --------------------------------------------------------------------- |
| `target_collection_id` | Yes      | Collection to join with                                               |
| `source_field`         | Yes\*    | Field in current documents to match                                   |
| `target_field`         | Yes\*    | Field in target collection to match against                           |
| `fields_to_merge`      | No       | Specific fields to merge (or entire document if omitted)              |
| `output_field`         | No       | Where to place enrichment (root or nested path)                       |
| `retriever_id`         | No       | Use an existing retriever for lookup instead of direct field matching |
| `retriever_config`     | No       | Anonymous retriever definition for complex lookups                    |
| `retriever_inputs`     | No       | Template inputs when using retriever-based enrichment                 |
| `strategy`             | No       | `enrich` (merge fields) or `append` (add as nested object)            |
| `allow_missing`        | No       | Keep documents without matches (default: true)                        |
| `when`                 | No       | Conditional filter for selective enrichment                           |
| `cache_behavior`       | No       | `auto`, `disabled`, or `aggressive`                                   |
| `cache_ttl_seconds`    | No       | Cache TTL in seconds                                                  |

\*Required for direct joins; not needed when using `retriever_id` or `retriever_config`.

**Examples:**

<CodeGroup>
  ```json Direct Field Join theme={null}
  {
    "stage_name": "document_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "col_products",
        "source_field": "metadata.product_id",
        "target_field": "product_id",
        "fields_to_merge": ["name", "price", "category"],
        "output_field": "product_data"
      }
    }
  }
  ```

  ```json Retriever-Based Enrichment theme={null}
  {
    "stage_name": "document_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "col_similar_items",
        "retriever_id": "ret_find_similar_products",
        "retriever_inputs": {
          "query": "{{DOC.description}}",
          "category": "{{DOC.metadata.category}}"
        },
        "fields_to_merge": ["name", "price", "image_url"],
        "output_field": "similar_products",
        "strategy": "append"
      }
    }
  }
  ```

  ```json Conditional Enrichment theme={null}
  {
    "stage_name": "document_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "col_specs",
        "source_field": "metadata.sku",
        "target_field": "sku",
        "fields_to_merge": ["specifications", "warranty"],
        "output_field": "metadata.technical_details",
        "when": {
          "field": "metadata.category",
          "operator": "eq",
          "value": "electronics"
        }
      }
    }
  }
  ```
</CodeGroup>

***

### api\_call

Enriches documents by calling external HTTP APIs. Enables integration with third-party services (Stripe, GitHub, weather APIs, etc.) to augment documents with real-time data.

<Warning>
  **Security**: This stage makes external HTTP requests. Always use `allowed_domains` to prevent SSRF attacks. Never store credentials directly—use `auth.secret_ref` to reference vault-stored secrets.
</Warning>

**Parameters:**

| Parameter           | Required | Description                                                             |
| ------------------- | -------- | ----------------------------------------------------------------------- |
| `url`               | Yes      | API endpoint URL (supports `{DOC.field}` and `{INPUT.field}` templates) |
| `allowed_domains`   | Yes      | Domain allowlist for SSRF protection (never use `*`)                    |
| `output_field`      | Yes      | Dot-path where API response should be stored                            |
| `method`            | No       | HTTP method: GET, POST, PUT, PATCH, DELETE (default: GET)               |
| `auth`              | No       | Authentication configuration (see below)                                |
| `headers`           | No       | Additional HTTP headers                                                 |
| `body`              | No       | Request body for POST/PUT/PATCH (JSON, supports templates)              |
| `timeout`           | No       | Request timeout in seconds (1-60, default: 10)                          |
| `max_response_size` | No       | Maximum response size in bytes (default: 10MB)                          |
| `response_path`     | No       | JSONPath to extract specific field from response                        |
| `rate_limit`        | No       | Rate limiting config (`requests_per_minute`, `requests_per_hour`)       |
| `when`              | No       | Conditional filter for selective enrichment                             |
| `on_error`          | No       | Error handling: `skip`, `remove`, or `raise` (default: skip)            |

**Authentication Types:**

| Type            | Description                                   | Required Fields                                |
| --------------- | --------------------------------------------- | ---------------------------------------------- |
| `none`          | No authentication (public APIs)               | —                                              |
| `bearer`        | Bearer token (OAuth 2.0, JWT)                 | `secret_ref`                                   |
| `api_key`       | API key in header or query param              | `secret_ref`, `key`, `location` (header/query) |
| `basic`         | HTTP Basic Auth (username:password in secret) | `secret_ref`                                   |
| `custom_header` | Custom header with arbitrary name             | `secret_ref`, `key`                            |

**Examples:**

<CodeGroup>
  ```json Stripe Customer Lookup theme={null}
  {
    "stage_name": "api_call",
    "stage_type": "apply",
    "config": {
      "stage_id": "api_call",
      "parameters": {
        "url": "https://api.stripe.com/v1/customers/{DOC.metadata.stripe_id}",
        "method": "GET",
        "allowed_domains": ["api.stripe.com"],
        "auth": {
          "type": "bearer",
          "secret_ref": "stripe_api_key"
        },
        "output_field": "metadata.stripe_data",
        "timeout": 10,
        "on_error": "skip"
      }
    }
  }
  ```

  ```json GitHub API (Public) theme={null}
  {
    "stage_name": "api_call",
    "stage_type": "apply",
    "config": {
      "stage_id": "api_call",
      "parameters": {
        "url": "https://api.github.com/repos/{INPUT.owner}/{INPUT.repo}",
        "method": "GET",
        "allowed_domains": ["api.github.com"],
        "output_field": "metadata.github_info",
        "response_path": "$.stargazers_count"
      }
    }
  }
  ```

  ```json POST with API Key theme={null}
  {
    "stage_name": "api_call",
    "stage_type": "apply",
    "config": {
      "stage_id": "api_call",
      "parameters": {
        "url": "https://api.example.com/v1/analyze",
        "method": "POST",
        "allowed_domains": ["api.example.com"],
        "auth": {
          "type": "api_key",
          "key": "X-API-Key",
          "location": "header",
          "secret_ref": "example_api_key"
        },
        "headers": {
          "Content-Type": "application/json"
        },
        "body": {
          "text": "{DOC.text}",
          "language": "en"
        },
        "output_field": "metadata.analysis"
      }
    }
  }
  ```
</CodeGroup>

***

### json\_transform

Applies a Jinja2 template to each document, rendering the template with full document context and replacing the document with the parsed JSON output. Use this to reformat documents for external APIs or reshape data for downstream consumers.

**Parameters:**

| Parameter       | Required | Description                                                   |
| --------------- | -------- | ------------------------------------------------------------- |
| `template`      | Yes      | Jinja2 template string that must render to valid JSON         |
| `fail_on_error` | No       | Fail entire pipeline on transformation error (default: false) |

**Template Context:**

| Namespace             | Description                                           |
| --------------------- | ----------------------------------------------------- |
| `DOC` / `doc`         | Current document fields and metadata                  |
| `INPUT` / `inputs`    | Original query inputs from the search request         |
| `CONTEXT` / `context` | Execution context (namespace\_id, internal\_id, etc.) |
| `STAGE` / `stage`     | Current stage execution data                          |

**Examples:**

<CodeGroup>
  ```json Field Selection theme={null}
  {
    "stage_name": "json_transform",
    "stage_type": "apply",
    "config": {
      "stage_id": "json_transform",
      "parameters": {
        "template": "{\"id\": \"{{ DOC.document_id }}\", \"content\": \"{{ DOC.text }}\", \"score\": {{ DOC.score }}}"
      }
    }
  }
  ```

  ```json Conditional Fields theme={null}
  {
    "stage_name": "json_transform",
    "stage_type": "apply",
    "config": {
      "stage_id": "json_transform",
      "parameters": {
        "template": "{\"workflow_name\": \"process-asset\", \"inputs\": [{\"name\": \"id\", \"value\": \"{{ DOC.id }}\"}{% if DOC.asset_type == \"VIDEO\" %}, {\"name\": \"video\", \"value\": {\"src\": \"{{ DOC.url }}\"}}{% endif %}]}"
      }
    }
  }
  ```

  ```json Array Iteration theme={null}
  {
    "stage_name": "json_transform",
    "stage_type": "apply",
    "config": {
      "stage_id": "json_transform",
      "parameters": {
        "template": "{\"title\": \"{{ DOC.title }}\", \"tags\": [{% for tag in DOC.tags %}\"{{ tag }}\"{% if not loop.last %}, {% endif %}{% endfor %}]}"
      }
    }
  }
  ```

  ```json JSON Escaping theme={null}
  {
    "stage_name": "json_transform",
    "stage_type": "apply",
    "config": {
      "stage_id": "json_transform",
      "parameters": {
        "template": "{\"user_id\": \"{{ DOC.metadata.user_id }}\", \"raw_data\": {{ DOC.metadata.raw | tojson }}}"
      }
    }
  }
  ```
</CodeGroup>

***

### external\_web\_search

Performs AI-native web search using Exa's neural ranking system. Creates new documents from web search results, enabling retriever pipelines to incorporate real-time internet content.

<Note>
  This stage **creates** new documents (0 → M transformation) rather than enriching existing ones. Use it at the start of a pipeline or to augment internal results with external web sources.
</Note>

**Parameters:**

| Parameter              | Required | Description                                                                             |
| ---------------------- | -------- | --------------------------------------------------------------------------------------- |
| `query`                | Yes      | Search query (supports `{INPUT.field}` and `{DOC.field}` templates)                     |
| `num_results`          | No       | Number of results (1-100, default: 10)                                                  |
| `use_autoprompt`       | No       | Enable Exa's query enhancement (default: true)                                          |
| `start_published_date` | No       | Filter by publication date (YYYY-MM-DD format)                                          |
| `category`             | No       | Content category: `research paper`, `news`, `github`, `tweet`, `blog`, `company`, `pdf` |
| `include_text`         | No       | Include text snippets in results (default: true)                                        |

**Output Schema:**

Each result becomes a document with:

* `metadata.url` – Web page URL
* `metadata.title` – Page title
* `metadata.text` – Text snippet (if `include_text=true`)
* `metadata.published_date` – Publication date (if available)
* `metadata.author` – Author name (if available)
* `metadata.search_query` – Original query used
* `metadata.search_position` – 0-indexed position in results
* `score` – Exa relevance score

**Examples:**

<CodeGroup>
  ```json Basic Web Search theme={null}
  {
    "stage_name": "external_web_search",
    "stage_type": "apply",
    "config": {
      "stage_id": "external_web_search",
      "parameters": {
        "query": "{INPUT.query}",
        "num_results": 10,
        "include_text": true,
        "use_autoprompt": true
      }
    }
  }
  ```

  ```json Research Papers theme={null}
  {
    "stage_name": "external_web_search",
    "stage_type": "apply",
    "config": {
      "stage_id": "external_web_search",
      "parameters": {
        "query": "neural network architectures",
        "num_results": 20,
        "category": "research paper",
        "include_text": true
      }
    }
  }
  ```

  ```json Recent News theme={null}
  {
    "stage_name": "external_web_search",
    "stage_type": "apply",
    "config": {
      "stage_id": "external_web_search",
      "parameters": {
        "query": "{INPUT.company_name} latest product launches",
        "num_results": 5,
        "category": "news",
        "start_published_date": "2024-10-01",
        "include_text": true
      }
    }
  }
  ```
</CodeGroup>

***

### cross\_compare

Compares source documents against a reference collection using a cascading match strategy (exact → fuzzy → semantic → visual). Each comparison produces a classified finding with a score and confidence level. Ideal for drift detection, deduplication, and compliance checking.

<Note>
  This stage can output either **findings** (N → M, one document per comparison) or **enriched documents** (N → N, results attached as a field). Set `output_mode` to control this behavior.
</Note>

**Parameters:**

| Parameter                   | Required | Description                                                                     |
| --------------------------- | -------- | ------------------------------------------------------------------------------- |
| `reference_collection_id`   | Yes      | Collection containing reference documents to compare against                    |
| `source_field`              | No       | Field on source documents to extract comparison elements (default: `content`)   |
| `reference_field`           | No       | Field on reference documents containing comparison content (default: `content`) |
| `extraction_mode`           | No       | Element extraction: `raw`, `lines`, `labels`, `list` (default: `raw`)           |
| `match_tiers`               | No       | Ordered matching cascade (default: `["exact", "fuzzy"]`)                        |
| `fuzzy_threshold`           | No       | Minimum fuzzy score (default: 0.75)                                             |
| `semantic_threshold`        | No       | Minimum semantic similarity (default: 0.85)                                     |
| `visual_threshold`          | No       | Minimum visual similarity (default: 0.55)                                       |
| `classifications`           | No       | Score-to-label mapping rules (see below)                                        |
| `output_mode`               | No       | `findings` (N→M) or `enrich` (N→N) (default: `findings`)                        |
| `output_field`              | No       | Field name for enrich mode results (default: `comparison_results`)              |
| `include_visual_comparison` | No       | Enable DINOv2 + SigLIP visual comparison (default: false)                       |
| `reference_limit`           | No       | Max reference documents to fetch (default: 200)                                 |
| `source_doc_type_filter`    | No       | Only process source docs with this doc\_type                                    |

**Examples:**

<CodeGroup>
  ```json Content Drift Detection theme={null}
  {
    "stage_name": "cross_compare",
    "stage_type": "apply",
    "config": {
      "stage_id": "cross_compare",
      "parameters": {
        "reference_collection_id": "col_documentation",
        "source_field": "content",
        "reference_field": "content",
        "extraction_mode": "labels",
        "match_tiers": ["exact", "fuzzy", "semantic"],
        "include_visual_comparison": true,
        "classifications": [
          {"min_score": 0.95, "label": "current"},
          {"min_score": 0.75, "label": "needs_review"},
          {"min_score": 0.0, "label": "outdated"}
        ]
      }
    }
  }
  ```

  ```json Catalog Matching (Enrich Mode) theme={null}
  {
    "stage_name": "cross_compare",
    "stage_type": "apply",
    "config": {
      "stage_id": "cross_compare",
      "parameters": {
        "reference_collection_id": "col_internal_catalog",
        "source_field": "product_name",
        "reference_field": "product_name",
        "match_tiers": ["exact", "fuzzy"],
        "fuzzy_threshold": 0.80,
        "output_mode": "enrich",
        "output_field": "catalog_match",
        "classifications": [
          {"min_score": 0.95, "label": "exact_match"},
          {"min_score": 0.80, "label": "likely_match"},
          {"min_score": 0.0, "label": "no_match"}
        ]
      }
    }
  }
  ```

  ```json Deduplication Check theme={null}
  {
    "stage_name": "cross_compare",
    "stage_type": "apply",
    "config": {
      "stage_id": "cross_compare",
      "parameters": {
        "reference_collection_id": "col_existing_corpus",
        "source_field": "content",
        "reference_field": "content",
        "extraction_mode": "lines",
        "match_tiers": ["exact", "fuzzy", "semantic"],
        "semantic_threshold": 0.90,
        "output_mode": "enrich",
        "output_field": "duplication_analysis",
        "classifications": [
          {"min_score": 0.95, "label": "duplicate"},
          {"min_score": 0.80, "label": "near_duplicate"},
          {"min_score": 0.0, "label": "unique"}
        ]
      }
    }
  }
  ```
</CodeGroup>

***

Call `GET /v1/retrievers/stages` to retrieve the latest stage metadata and parameter schemas.

## Execution Lifecycle

1. **Validate Inputs** – Mixpeek enforces the retriever’s `input_schema`.
2. **Walk Stages** – Each stage receives the current working set, runs, and outputs a new set.
3. **Apply Pagination** – `limit`, `offset`, `cursor`, or `keyset` pagination is handled after the final stage.
4. **Return Telemetry** – Responses include `stage_statistics`, `budget`, and optional presigned URLs.

Response headers include:

* `ETag` – cache validator; pair with `If-None-Match` for 304 responses.
* `Cache-Control` – TTL derived from `cache_config`.
* `X-Cache` – `HIT` or `MISS` for query-level caching.

## Filters & Templates

Structured filters support comparison operators (`eq`, `gt`, `lte`, `in`, etc.) and logical composition (`AND`, `OR`, `NOT`).

### Template Namespaces

Stages support dynamic configuration through template expressions using Jinja2 syntax. Both uppercase and lowercase namespace formats are supported and work identically:

| Namespace             | Description                                          | Examples                                                      |
| --------------------- | ---------------------------------------------------- | ------------------------------------------------------------- |
| `INPUT` / `inputs`    | User-provided query parameters and inputs            | `{{INPUT.query_text}}`, `{{inputs.max_price}}`                |
| `DOC` / `doc`         | Current document fields (for per-document logic)     | `{{DOC.metadata.category}}`, `{{doc.content_type}}`           |
| `CONTEXT` / `context` | Execution state (budget, timing, retriever metadata) | `{{CONTEXT.budget_remaining}}`, `{{context.time_elapsed_ms}}` |
| `STAGE` / `stage`     | Previous stage outputs (for cascading logic)         | `{{STAGE.hybrid_search.top_score}}`, `{{stage.filter.count}}` |

<Note>
  Mixed usage within the same stage is supported. For example, you can use `{{INPUT.query}}` alongside `{{context.budget_remaining}}` in the same configuration.
</Note>

**Conditional expressions:**

```json theme={null}
{
  "batch_size": "{{CONTEXT.budget_remaining > 50 ? 200 : 50}}",
  "field": "{{DOC.media_type == 'image' ? 'image_url' : 'video_url'}}"
}
```

**Templated batch size:**

```json theme={null}
{
  "batch_size": "{{20 * inputs.page_size}}"
}
```

## Retrievers & Caching

* **Query cache** – caches entire responses keyed by inputs, filters, pagination, and collection index signatures.
* **Stage cache** – reuse outputs of expensive stages by listing them under `cache_stage_names`.
* **Inference cache** – Engine deduplicates identical model calls.

Use `GET /v1/analytics/retrievers/{id}/cache-performance` to monitor hit rates and latency improvements.

## Pagination Options

| Method   | Use Case                                                |
| -------- | ------------------------------------------------------- |
| `offset` | Simple pagination, supports `limit` + `offset`          |
| `cursor` | Stable iteration over large result sets                 |
| `scroll` | Deep pagination for analytics workloads                 |
| `keyset` | High-performance paginated browsing (requires sort key) |

Specify the method in `pagination.method` when executing a retriever.

## Execute a Retriever

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/<retriever_id>/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query_text": "wireless earbuds",
      "max_price": 150
    },
    "filters": {
      "field": "metadata.category",
      "operator": "eq",
      "value": "audio"
    },
    "limit": 10,
    "return_urls": true,
    "return_vectors": false,
    "session_id": "sess_123"
  }'
```

Response snippet:

```json theme={null}
{
  "execution_id": "exec_b8f31e0c",
  "documents": [...],
  "stage_statistics": {
    "hybrid_search": { "duration_ms": 180, "cache_hit": true },
    "filter": { "duration_ms": 8 },
    "rerank": { "duration_ms": 120 }
  },
  "budget": {
    "credits_used": 12.4,
    "credits_limit": 100,
    "time_elapsed_ms": 310
  }
}
```

## Batch Execution

Execute a retriever against multiple inputs in a single request. The retriever is fetched and optimized once, then executed concurrently across all queries with bounded parallelism.

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST "$MP_API_URL/v1/retrievers/<retriever_id>/execute/batch" \
    -H "Authorization: Bearer $MP_API_KEY" \
    -H "X-Namespace: $MP_NAMESPACE" \
    -H "Content-Type: application/json" \
    -d '{
      "queries": [
        {"inputs": {"image": "https://example.com/suspect-1.jpg"}},
        {"inputs": {"image": "https://example.com/suspect-2.jpg"}},
        {"inputs": {"image": "https://example.com/suspect-3.jpg"}}
      ],
      "concurrency": 5
    }'
  ```

  ```python Python theme={null}
  results = client.retrievers.execute_batch(
      retriever_id="ret_abc123",
      queries=[
          {"inputs": {"image": "https://example.com/suspect-1.jpg"}},
          {"inputs": {"image": "https://example.com/suspect-2.jpg"}},
          {"inputs": {"image": "https://example.com/suspect-3.jpg"}},
      ],
      concurrency=5,
  )

  for result in results:
      print(f"Query {result.query_index}: {result.status}, {len(result.documents)} docs")
  ```
</CodeGroup>

| Parameter     | Type    | Default  | Description                                                          |
| ------------- | ------- | -------- | -------------------------------------------------------------------- |
| `queries`     | array   | required | 1–50 query objects, each with `inputs` and optional `filters`        |
| `concurrency` | integer | `5`      | Max parallel executions (1–20)                                       |
| `settings`    | object  | `null`   | Shared settings applied to every query (e.g., `limit`, `max_chunks`) |
| `stream`      | boolean | `false`  | Stream results via Server-Sent Events                                |

Response:

```json theme={null}
{
  "retriever_id": "ret_abc123",
  "total_queries": 3,
  "completed": 3,
  "failed": 0,
  "results": [
    {
      "query_index": 0,
      "status": "completed",
      "documents": [...],
      "execution_id": "exec_a1b2c3"
    }
  ],
  "total_duration_ms": 61200
}
```

### Streaming

For pipelines with LLM stages that take longer than a few seconds per query, use `"stream": true` to receive results as each query completes:

```bash theme={null}
curl -N -X POST "$MP_API_URL/v1/retrievers/<retriever_id>/execute/batch" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [
      {"inputs": {"image": "https://example.com/suspect-1.jpg"}},
      {"inputs": {"image": "https://example.com/suspect-2.jpg"}}
    ],
    "concurrency": 5,
    "stream": true
  }'
```

The response is an SSE stream:

```
: stream-start
: keepalive

data: {"event_type":"query_complete","query_index":0,"total_queries":2,"status":"completed","documents":[...]}

data: {"event_type":"query_complete","query_index":1,"total_queries":2,"status":"completed","documents":[...]}

data: {"event_type":"batch_complete","retriever_id":"ret_abc123","total_queries":2,"completed":2,"failed":0,"total_duration_ms":61200}
```

Keepalive comments (`: keepalive`) are sent every 15 seconds to keep the connection alive through proxies. Results arrive out of order as each query finishes — use `query_index` to match results to inputs.

### When to use batch

| Scenario                                    | Approach                                                             |
| ------------------------------------------- | -------------------------------------------------------------------- |
| Single query                                | `/execute`                                                           |
| 2–50 queries, fast pipeline (under 10s)     | `/execute/batch`                                                     |
| 2–50 queries, LLM-heavy pipeline (over 10s) | `/execute/batch` with `"stream": true`                               |
| 50+ queries                                 | Split into batches of 50, call sequentially or from multiple threads |

## Anonymous Retrievers

Anonymous retrievers let you execute a retriever pipeline in a single API call without persisting it. This is ideal for:

* **Prototyping and experimentation** – Test stage configurations quickly without cluttering your retriever list
* **Dynamic pipelines** – Build pipelines on-the-fly based on user context or application logic
* **One-off queries** – Run complex searches that don't need to be reused
* **CI/CD testing** – Validate retriever configurations in automated tests without creating permanent resources

<Tip>
  Use anonymous retrievers during development to iterate quickly, then promote working configurations to named retrievers for production use.
</Tip>

```bash theme={null}
curl -sS -X POST "$MP_API_URL/v1/retrievers/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_identifiers": ["col_products"],
    "input_schema": {
      "query_text": { "type": "text", "required": true }
    },
    "stages": [
      {
        "stage_name": "search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "searches": [
              {
                "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
                "query": { "input_mode": "text", "value": "{{INPUT.query_text}}" },
                "top_k": 50
              }
            ],
            "final_top_k": 50
          }
        }
      },
      {
        "stage_name": "rerank",
        "stage_type": "sort",
        "config": {
          "stage_id": "rerank",
          "parameters": {
            "inference_name": "BAAI__bge_reranker_v2_m3"
          }
        }
      }
    ],
    "inputs": {
      "query_text": "noise cancelling headphones"
    }
  }'
```

The response format is identical to named retriever execution. The key difference: no `retriever_id` is created or stored.

## Publishing & Display Config

Retrievers can be published as public search interfaces hosted at `mxp.co`. Publishing requires a `display_config` that controls how the UI renders inputs, results, and styling.

Set `display_config` when creating or updating a retriever via `PATCH /v1/retrievers/{id}`:

```json theme={null}
{
  "display_config": {
    "title": "Product Search",
    "description": "Search through thousands of products",
    "logo_url": "https://example.com/logo.png",
    "theme": {
      "primary_color": "#4F46E5",
      "mode": "light"
    },
    "inputs": [
      {
        "field_name": "query_text",
        "field_schema": { "type": "text", "required": true }
      }
    ],
    "exposed_fields": ["title", "thumbnail_url", "price", "category"],
    "layout": {
      "results_layout": "grid",
      "columns": 3
    },
    "field_config": {
      "thumbnail_url": {
        "format": "image",
        "format_options": { "width": 400, "height": 300, "aspect_ratio": "16/9" }
      },
      "price": {
        "format": "number",
        "format_options": { "label": "Price", "decimals": 2, "prefix": "$" }
      }
    }
  }
}
```

**Key fields:**

| Field            | Required | Description                                                                            |
| ---------------- | -------- | -------------------------------------------------------------------------------------- |
| `title`          | Yes      | Page heading for the public search interface                                           |
| `description`    | No       | Subtitle/description text                                                              |
| `logo_url`       | No       | URL to logo image                                                                      |
| `inputs`         | Yes      | List of input fields to render (maps to `input_schema`)                                |
| `exposed_fields` | Yes      | Document fields shown in results (at least one required)                               |
| `layout`         | No       | Result layout: `grid`, `list`, or `table` with column count                            |
| `theme`          | No       | Primary color, dark/light mode                                                         |
| `field_config`   | No       | Per-field display format: `text`, `image`, `date`, `number`, `url`, `boolean`, `array` |
| `seo`            | No       | SEO metadata (auto-generated from title/description if omitted)                        |
| `template_type`  | No       | Built-in template: `portrait-gallery`, `media-search`, `document-search`               |
| `field_mappings` | No       | Maps template slots (e.g., `thumbnail`, `title`) to actual field names                 |

Once `display_config` is set, publish the retriever with `POST /v1/retrievers/{id}/publish`.

***

## Maintenance & Versioning

* Use `PATCH /v1/retrievers/{id}` to rename retrievers or adjust cache settings (stages and schema are immutable; create a new retriever for breaking changes).
* List retrievers with filters, search, and sort: `POST /v1/retrievers/list`.
* Retrieve execution history: `GET /v1/retrievers/{id}/executions`.
* Diagnose pipelines without executing: `POST /v1/retrievers/{id}/explain`.

## Interaction Feedback

Capture user feedback with `/v1/retrievers/interactions` to power downstream analytics, learning-to-rank, or personalized retrieval:

```json theme={null}
{
  "feature_id": "doc_abc123",
  "interaction_type": ["click", "long_view"],
  "position": 2,
  "metadata": { "duration_ms": 12000 },
  "user_id": "user_456",
  "session_id": "sess_xyz789"
}
```

## Best Practices

1. **Start narrow** – run a single search stage before adding rerankers or joins.
2. **Push filters early** – stage-level filters shrink the candidate set before expensive operations.
3. **Use JOIN strategies wisely** – `direct` for key-based joins, `retriever` for similarity joins; set `join_strategy` to control merge behavior.
4. **Enable caching** – stage caching plus query caching dramatically reduces latency for repeat queries.
5. **Monitor analytics** – use retriever analytics endpoints to optimize parameters, detect slow stages, and understand cache ROI.

Retrievers turn Mixpeek’s primitives—features, taxonomies, clusters, and models—into end-user search experiences. Configure once, execute anywhere, and evolve the pipeline with confidence.
