Best Practices - Mixpeek

Schema Design

One bucket per data domain (products, support tickets, footage). Keep schemas coarse — collections slice data differently downstream.
Separate mutable from immutable fields. Put stable data (file URL, created date) in the bucket schema. Put changing data (status, tags) in metadata that can be patched.
Use key_prefix in objects for organization (e.g., /2025/04/).
Enable document schema validation on collections (validation_mode: "strict") to catch malformed data early.

Feature Selection

Data Type	Recommended Extractor	Output
Text search	`text_extractor`	E5-Large 1024D embeddings
Image search	`multimodal_extractor`	Vertex AI 1408D embeddings
Face matching	`face_identity_extractor`	ArcFace 512D embeddings
Video scenes	`multimodal_extractor`	Scene embeddings + transcripts + keyframes
Documents/PDF	`universal_extractor`	Text chunks + OCR + embeddings
Audio	`multimodal_extractor`	Transcripts + audio embeddings

Start with one extractor per collection. Add more collections for additional features rather than overloading one.
Match extractors to your query patterns. If users search by text, prioritize text embeddings. If they search by image, prioritize visual embeddings.
Test with a small batch first before processing your full corpus.

Caching

Mixpeek caches at two levels:

Retriever-level — cache the full pipeline result. Set cache_config.ttl_seconds on the retriever.
Stage-level — cache expensive stages (KNN search, reranking) independently. Useful when early stages are stable but later stages change.

{
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 300
  }
}

Collection index signatures auto-invalidate caches when documents change.
Use shorter TTLs (60-300s) for frequently updated collections, longer (3600s+) for stable corpora.
Monitor cache hit rates via the analytics API.

Cost Optimization

Mixpeek prices in dollars: each feature you enable is billed per modality unit (images, video minutes, document pages, text tokens, crawled web pages). Each tier includes a monthly dollar usage pool, and overage bills at the rate card — see Billing. Quote a workload before running it with POST /v1/organizations/billing/estimate. Key cost drivers:

Operation	Cost Level	Optimization
Feature extraction	High	Choose the right extractor — don’t over-extract
LLM enrichment stages	High	Set `max_tokens` limits, cache results
Vector search	Low	Reads are included with each tier ($2 per 1M queries beyond included) — cheap; the heavy cost is enrichment/LLM stages. See Rate Limits & Quotas
Storage	Low	Auto-tiered — hot/warm/cold based on access patterns

Top optimizations:

Deduplicate before ingesting — skip objects already processed
Use field passthrough for metadata that doesn’t need extraction
Batch process rather than single-object ingestion
Cache retriever results for repeated query patterns
Set reranking limits (top_k) to avoid scoring too many candidates

​Schema Design

​Feature Selection

​Caching

​Cost Optimization

Schema Design

Feature Selection

Caching

Cost Optimization