Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Semantic search is one stage in the warehouse’s Reassemble layer. This tutorial covers the basics; see
Multi-Stage Retrieval for composing it with other stages.
1. Create a Bucket
POST /v1/buckets
{
"bucket_name": "knowledge-base",
"schema": {
"properties": {
"title": { "type": "text", "required": true },
"content": { "type": "text", "required": true },
"category": { "type": "text" },
"tags": { "type": "array" }
}
}
}
2. Create a Collection
POST /v1/collections
{
"collection_name": "docs-search",
"source": { "type": "bucket", "bucket_id": "bkt_kb" },
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"input_mappings": { "text": "content" },
"parameters": {
"model": "multilingual-e5-large-instruct",
"chunk_strategy": "sentence",
"chunk_size": 512,
"chunk_overlap": 50
},
"field_passthrough": [
{ "source_path": "title" },
{ "source_path": "category" },
{ "source_path": "tags" }
]
}
}
Chunking strategies:
sentence – Best for Q&A
paragraph – Best for long-form content
fixed – Predictable token windows
3. Ingest Documents
POST /v1/buckets/{bucket_id}/objects
{
"key_prefix": "/docs/api",
"metadata": {
"title": "Authentication Guide",
"content": "Mixpeek uses Bearer token authentication...",
"category": "getting-started",
"tags": ["auth", "security"]
}
}
4. Create a Retriever
POST /v1/retrievers
{
"retriever_name": "docs-search",
"collection_ids": ["col_docs"],
"input_schema": {
"properties": {
"query": { "type": "text", "required": true }
}
},
"stages": [
{
"stage_name": "knn_search",
"version": "v1",
"parameters": {
"feature_address": "mixpeek://text_extractor@v1/text_embedding",
"input_mapping": { "text": "query" },
"limit": 50
}
}
]
}
5. Search
POST /v1/retrievers/{retriever_id}/execute
{
"inputs": { "query": "how do I authenticate API requests?" },
"limit": 10
}
Hybrid Search (Vector + BM25)
Combine semantic and keyword matching:
{
"stages": [
{
"stage_name": "hybrid_search",
"version": "v1",
"parameters": {
"queries": [
{
"feature_address": "mixpeek://text_extractor@v1/text_embedding",
"input_mapping": { "text": "query" },
"weight": 0.7
},
{
"feature_address": "mixpeek://text_extractor@v1/bm25_sparse",
"input_mapping": { "text": "query" },
"weight": 0.3
}
],
"fusion_method": "rrf",
"limit": 50
}
}
]
}
Filter before vector search for efficiency:
{
"stages": [
{
"stage_name": "filter",
"version": "v1",
"parameters": {
"filters": {
"operator": "and",
"conditions": [
{
"field": "metadata.category",
"operator": "eq",
"value": "getting-started"
}
]
}
}
},
{
"stage_name": "knn_search",
"version": "v1",
"parameters": {
"feature_address": "mixpeek://text_extractor@v1/text_embedding",
"input_mapping": { "text": "query" },
"limit": 50
}
}
]
}
Reranking
Use a cross-encoder for better accuracy:
{
"stages": [
{
"stage_name": "knn_search",
"parameters": { "limit": 100 }
},
{
"stage_name": "rerank",
"version": "v1",
"parameters": {
"model": "cross-encoder/ms-marco-MiniLM-L-12-v2",
"input_mapping": {
"query": "query",
"document": "metadata.content"
},
"top_k": 20
}
}
]
}
Model Options
| Model | Speed | Use Case |
|---|
multilingual-e5-base | Fast | High-volume |
multilingual-e5-large-instruct | Medium | General-purpose |
bge-large-en-v1.5 | Medium | English-only |
openai/text-embedding-3-large | Slow | Premium |
Classify with Taxonomies
Auto-categorize documents using a reference collection:
POST /v1/taxonomies
{
"taxonomy_name": "doc-categories",
"taxonomy_type": "flat",
"retriever_id": "ret_docs_search",
"collection_id": "col_docs",
"input_mappings": [{ "source": "payload.content", "target": "query" }],
"enrichment_fields": [{ "source": "payload.category", "target": "auto_category" }],
"threshold": 0.7,
"execution_mode": "materialize"
}
New documents matching the threshold automatically get the auto_category field enriched. See Taxonomies for hierarchical taxonomies and execution modes.
Discover Clusters
Find topic groups in your content without predefined categories:
POST /v1/clusters
{
"cluster_name": "doc-topics",
"collection_id": "col_docs",
"feature_uri": "mixpeek://text_extractor@v1/text_embedding",
"algorithm": { "name": "hdbscan", "params": { "min_cluster_size": 5 } },
"llm_labeling": { "enabled": true },
"dimension_reduction": { "method": "umap", "n_components": 2 }
}
Once clusters stabilize, promote them to taxonomy nodes to auto-classify future documents. See Clusters for all algorithms and scheduling.
Set Up Alerts
Get notified when new documents match specific conditions:
POST /v1/alerts
{
"alert_name": "new-security-docs",
"collection_id": "col_docs",
"condition": { "field": "metadata.category", "operator": "eq", "value": "security" },
"notification": { "type": "webhook", "url": "https://example.com/webhook" }
}
Set Up Webhooks
Monitor batch processing and retriever events:
POST /v1/webhooks
{
"webhook_name": "batch-complete",
"url": "https://example.com/webhook",
"events": ["batch.completed", "batch.failed"]
}
Alerts evaluate every incoming document; webhooks fire on system events. See Operations for the full list of events.