Schema Design
- One bucket per data domain (products, support tickets, footage). Keep schemas coarse — collections slice data differently downstream.
- Separate mutable from immutable fields. Put stable data (file URL, created date) in the bucket schema. Put changing data (status, tags) in metadata that can be patched.
- Use
key_prefixin objects for organization (e.g.,/2025/04/). - Enable document schema validation on collections (
validation_mode: "strict") to catch malformed data early.
Feature Selection
| Data Type | Recommended Extractor | Output |
|---|---|---|
| Text search | text_extractor | E5-Large 1024D embeddings |
| Image search | multimodal_extractor | Vertex AI 1408D embeddings |
| Face matching | face_detector | ArcFace 512D embeddings |
| Video scenes | multimodal_extractor | Scene embeddings + transcripts + keyframes |
| Documents/PDF | document_extractor | Text chunks + OCR + embeddings |
| Audio | multimodal_extractor | Transcripts + audio embeddings |
- Start with one extractor per collection. Add more collections for additional features rather than overloading one.
- Match extractors to your query patterns. If users search by text, prioritize text embeddings. If they search by image, prioritize visual embeddings.
- Test with a small batch first before processing your full corpus.
Caching
Mixpeek caches at two levels:- Retriever-level — cache the full pipeline result. Set
cache_config.ttl_secondson the retriever. - Stage-level — cache expensive stages (KNN search, reranking) independently. Useful when early stages are stable but later stages change.
- Collection index signatures auto-invalidate caches when documents change.
- Use shorter TTLs (60-300s) for frequently updated collections, longer (3600s+) for stable corpora.
- Monitor cache hit rates via the analytics API.
Cost Optimization
Mixpeek uses a credit-based pricing model. Key cost drivers:| Operation | Cost Level | Optimization |
|---|---|---|
| Feature extraction | High | Choose the right extractor — don’t over-extract |
| LLM enrichment stages | High | Set max_tokens limits, cache results |
| Vector search | Low | Searches are free — no penalty for high query volume |
| Storage | Low | Auto-tiered — hot/warm/cold based on access patterns |
- Deduplicate before ingesting — skip objects already processed
- Use field passthrough for metadata that doesn’t need extraction
- Batch process rather than single-object ingestion
- Cache retriever results for repeated query patterns
- Set reranking limits (
top_k) to avoid scoring too many candidates

