> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Semantic Search

> Build search with vector embeddings and hybrid ranking

<Tip>Semantic search is one stage in the warehouse's Reassemble layer. This tutorial covers the basics; see [Multi-Stage Retrieval](/retrieval/multi-stage-deep-dive) for composing it with other stages.</Tip>

## 1. Create a Bucket

```bash theme={null}
POST /v1/buckets
{
  "bucket_name": "knowledge-base",
  "bucket_schema": {
    "properties": {
      "title": { "type": "text", "required": true },
      "content": { "type": "text", "required": true },
      "category": { "type": "text" },
      "tags": { "type": "array" }
    }
  }
}
```

## 2. Create a Collection

```bash theme={null}
POST /v1/collections
{
  "collection_name": "docs-search",
  "source": { "type": "bucket", "bucket_ids": ["bkt_kb"] },
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": { "text": "content" },
    "parameters": {
      "model": "multilingual-e5-large-instruct",
      "chunk_strategy": "sentence",
      "chunk_size": 512,
      "chunk_overlap": 50
    },
    "field_passthrough": [
      { "source_path": "title" },
      { "source_path": "category" },
      { "source_path": "tags" }
    ]
  }
}
```

**Chunking strategies:**

* `sentence` – Best for Q\&A
* `paragraph` – Best for long-form content
* `fixed` – Predictable token windows

## 3. Ingest Documents

```bash theme={null}
POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/docs/api",
  "blobs": [
    { "property": "title", "type": "text", "data": "Authentication Guide" },
    { "property": "content", "type": "text", "data": "Mixpeek uses Bearer token authentication..." },
    { "property": "category", "type": "text", "data": "getting-started" }
  ]
}
```

## 4. Create a Retriever

```bash theme={null}
POST /v1/retrievers
{
  "retriever_name": "docs-search",
  "collection_identifiers": ["col_docs"],
  "input_schema": {
    "query": { "type": "text", "required": true }
  },
  "stages": [
    {
      "stage_name": "knn_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 50
            }
          ],
          "final_top_k": 50
        }
      }
    }
  ]
}
```

## 5. Search

```bash theme={null}
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": { "query": "how do I authenticate API requests?" }
}
```

## Hybrid Search (Vector + BM25)

Combine semantic (vector) and keyword (BM25) matching in one stage. BM25 is **not** a separate feature — set `lexical: true` on a search to match the query against the namespace's full-text index instead of embedding it. Use `rrf` fusion so the score-scale mismatch between cosine similarity and BM25 doesn't matter.

```bash theme={null}
{
  "stages": [
    {
      "stage_name": "hybrid_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 100
            },
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "lexical": true,
              "top_k": 100
            }
          ],
          "fusion": "rrf",
          "final_top_k": 50
        }
      }
    }
  ]
}
```

<Note>
  Lexical search requires a `text` payload index on the field. See [Text Indexes (BM25)](/vector-store/namespaces#text-indexes-bm25) and the [Feature Search](/retrieval/stages/feature-search#lexical-bm25-search) reference.
</Note>

## Pre-Filter by Metadata

Filter before vector search for efficiency:

```bash theme={null}
{
  "stages": [
    {
      "stage_name": "category_filter",
      "stage_type": "filter",
      "config": {
        "stage_id": "attribute_filter",
        "parameters": {
          "field": "category",
          "operator": "eq",
          "value": "getting-started"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 50
            }
          ],
          "final_top_k": 50
        }
      }
    }
  ]
}
```

## Reranking

Use a cross-encoder for better accuracy:

```bash theme={null}
{
  "stages": [
    {
      "stage_name": "knn_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 100
            }
          ],
          "final_top_k": 100
        }
      }
    },
    {
      "stage_name": "rerank",
      "stage_type": "sort",
      "config": {
        "stage_id": "rerank",
        "parameters": {
          "inference_name": "BAAI__bge_reranker_v2_m3",
          "top_k": 20
        }
      }
    }
  ]
}
```

See the [Rerank stage](/retrieval/stages/rerank) reference for available models and parameters.

## Model Options

| Model                            | Speed  | Use Case        |
| -------------------------------- | ------ | --------------- |
| `multilingual-e5-base`           | Fast   | High-volume     |
| `multilingual-e5-large-instruct` | Medium | General-purpose |
| `bge-large-en-v1.5`              | Medium | English-only    |
| `openai/text-embedding-3-large`  | Slow   | Premium         |

## Next steps

<CardGroup cols={2}>
  <Card title="Classify documents" icon="tags" href="/enrichment/taxonomies">
    Auto-categorize documents against a taxonomy of reference categories.
  </Card>

  <Card title="Discover topics" icon="diagram-project" href="/enrichment/clusters">
    Cluster document embeddings to surface topic groups without predefined categories.
  </Card>

  <Card title="Get notified" icon="bell" href="/enrichment/alerts">
    Trigger alerts when new documents match a condition.
  </Card>

  <Card title="Schedule jobs" icon="clock" href="/platform/triggers">
    Re-cluster or re-enrich on a cron or interval as new content lands.
  </Card>
</CardGroup>
