Skip to main content
Semantic search is one stage in the warehouse’s Reassemble layer. This tutorial covers the basics; see Multi-Stage Retrieval for composing it with other stages.

1. Create a Bucket

POST /v1/buckets
{
  "bucket_name": "knowledge-base",
  "bucket_schema": {
    "properties": {
      "title": { "type": "text", "required": true },
      "content": { "type": "text", "required": true },
      "category": { "type": "text" },
      "tags": { "type": "array" }
    }
  }
}

2. Create a Collection

POST /v1/collections
{
  "collection_name": "docs-search",
  "source": { "type": "bucket", "bucket_ids": ["bkt_kb"] },
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": { "text": "content" },
    "parameters": {
      "model": "multilingual-e5-large-instruct",
      "chunk_strategy": "sentence",
      "chunk_size": 512,
      "chunk_overlap": 50
    },
    "field_passthrough": [
      { "source_path": "title" },
      { "source_path": "category" },
      { "source_path": "tags" }
    ]
  }
}
Chunking strategies:
  • sentence – Best for Q&A
  • paragraph – Best for long-form content
  • fixed – Predictable token windows

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/docs/api",
  "blobs": [
    { "property": "title", "type": "text", "data": "Authentication Guide" },
    { "property": "content", "type": "text", "data": "Mixpeek uses Bearer token authentication..." },
    { "property": "category", "type": "text", "data": "getting-started" }
  ]
}

4. Create a Retriever

POST /v1/retrievers
{
  "retriever_name": "docs-search",
  "collection_identifiers": ["col_docs"],
  "input_schema": {
    "query": { "type": "text", "required": true }
  },
  "stages": [
    {
      "stage_name": "knn_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 50
            }
          ],
          "final_top_k": 50
        }
      }
    }
  ]
}
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": { "query": "how do I authenticate API requests?" }
}

Hybrid Search (Vector + BM25)

Combine semantic (vector) and keyword (BM25) matching in one stage. BM25 is not a separate feature — set lexical: true on a search to match the query against the namespace’s full-text index instead of embedding it. Use rrf fusion so the score-scale mismatch between cosine similarity and BM25 doesn’t matter.
{
  "stages": [
    {
      "stage_name": "hybrid_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 100
            },
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "lexical": true,
              "top_k": 100
            }
          ],
          "fusion": "rrf",
          "final_top_k": 50
        }
      }
    }
  ]
}
Lexical search requires a text payload index on the field. See Text Indexes (BM25) and the Feature Search reference.

Pre-Filter by Metadata

Filter before vector search for efficiency:
{
  "stages": [
    {
      "stage_name": "category_filter",
      "stage_type": "filter",
      "config": {
        "stage_id": "attribute_filter",
        "parameters": {
          "field": "category",
          "operator": "eq",
          "value": "getting-started"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 50
            }
          ],
          "final_top_k": 50
        }
      }
    }
  ]
}

Reranking

Use a cross-encoder for better accuracy:
{
  "stages": [
    {
      "stage_name": "knn_search",
      "stage_type": "filter",
      "config": {
        "stage_id": "feature_search",
        "parameters": {
          "searches": [
            {
              "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
              "query": { "input_mode": "text", "value": "{{INPUT.query}}" },
              "top_k": 100
            }
          ],
          "final_top_k": 100
        }
      }
    },
    {
      "stage_name": "rerank",
      "stage_type": "sort",
      "config": {
        "stage_id": "rerank",
        "parameters": {
          "inference_name": "BAAI__bge_reranker_v2_m3",
          "top_k": 20
        }
      }
    }
  ]
}
See the Rerank stage reference for available models and parameters.

Model Options

ModelSpeedUse Case
multilingual-e5-baseFastHigh-volume
multilingual-e5-large-instructMediumGeneral-purpose
bge-large-en-v1.5MediumEnglish-only
openai/text-embedding-3-largeSlowPremium

Next steps

Classify documents

Auto-categorize documents against a taxonomy of reference categories.

Discover topics

Cluster document embeddings to surface topic groups without predefined categories.

Get notified

Trigger alerts when new documents match a condition.

Schedule jobs

Re-cluster or re-enrich on a cron or interval as new content lands.