Skip to main content

Schema Design

  • One bucket per data domain (products, support tickets, footage). Keep schemas coarse — collections slice data differently downstream.
  • Separate mutable from immutable fields. Put stable data (file URL, created date) in the bucket schema. Put changing data (status, tags) in metadata that can be patched.
  • Use key_prefix in objects for organization (e.g., /2025/04/).
  • Enable document schema validation on collections (validation_mode: "strict") to catch malformed data early.

Feature Selection

Data TypeRecommended ExtractorOutput
Text searchtext_extractorE5-Large 1024D embeddings
Image searchmultimodal_extractorVertex AI 1408D embeddings
Face matchingface_detectorArcFace 512D embeddings
Video scenesmultimodal_extractorScene embeddings + transcripts + keyframes
Documents/PDFdocument_extractorText chunks + OCR + embeddings
Audiomultimodal_extractorTranscripts + audio embeddings
  • Start with one extractor per collection. Add more collections for additional features rather than overloading one.
  • Match extractors to your query patterns. If users search by text, prioritize text embeddings. If they search by image, prioritize visual embeddings.
  • Test with a small batch first before processing your full corpus.

Caching

Mixpeek caches at two levels:
  • Retriever-level — cache the full pipeline result. Set cache_config.ttl_seconds on the retriever.
  • Stage-level — cache expensive stages (KNN search, reranking) independently. Useful when early stages are stable but later stages change.
{
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 300
  }
}
  • Collection index signatures auto-invalidate caches when documents change.
  • Use shorter TTLs (60-300s) for frequently updated collections, longer (3600s+) for stable corpora.
  • Monitor cache hit rates via the analytics API.

Cost Optimization

Mixpeek uses a credit-based pricing model. Key cost drivers:
OperationCost LevelOptimization
Feature extractionHighChoose the right extractor — don’t over-extract
LLM enrichment stagesHighSet max_tokens limits, cache results
Vector searchLowSearches are free — no penalty for high query volume
StorageLowAuto-tiered — hot/warm/cold based on access patterns
Top optimizations:
  1. Deduplicate before ingesting — skip objects already processed
  2. Use field passthrough for metadata that doesn’t need extraction
  3. Batch process rather than single-object ingestion
  4. Cache retriever results for repeated query patterns
  5. Set reranking limits (top_k) to avoid scoring too many candidates