> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Best Practices

> Schema design, feature selection, caching, and cost optimization

## Schema Design

* **One bucket per data domain** (products, support tickets, footage). Keep schemas coarse — collections slice data differently downstream.
* **Separate mutable from immutable fields.** Put stable data (file URL, created date) in the bucket schema. Put changing data (status, tags) in metadata that can be patched.
* **Use `key_prefix`** in objects for organization (e.g., `/2025/04/`).
* **Enable document schema validation** on collections (`validation_mode: "strict"`) to catch malformed data early.

## Feature Selection

| Data Type     | Recommended Extractor     | Output                                     |
| ------------- | ------------------------- | ------------------------------------------ |
| Text search   | `text_extractor`          | E5-Large 1024D embeddings                  |
| Image search  | `multimodal_extractor`    | Vertex AI 1408D embeddings                 |
| Face matching | `face_identity_extractor` | ArcFace 512D embeddings                    |
| Video scenes  | `multimodal_extractor`    | Scene embeddings + transcripts + keyframes |
| Documents/PDF | `universal_extractor`     | Text chunks + OCR + embeddings             |
| Audio         | `multimodal_extractor`    | Transcripts + audio embeddings             |

* **Start with one extractor per collection.** Add more collections for additional features rather than overloading one.
* **Match extractors to your query patterns.** If users search by text, prioritize text embeddings. If they search by image, prioritize visual embeddings.
* **Test with a small batch first** before processing your full corpus.

## Caching

Mixpeek caches at two levels:

* **Retriever-level** — cache the full pipeline result. Set `cache_config.ttl_seconds` on the retriever.
* **Stage-level** — cache expensive stages (KNN search, reranking) independently. Useful when early stages are stable but later stages change.

```json theme={null}
{
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 300
  }
}
```

* Collection index signatures auto-invalidate caches when documents change.
* Use shorter TTLs (60-300s) for frequently updated collections, longer (3600s+) for stable corpora.
* Monitor cache hit rates via the [analytics API](/api-reference/retriever-evaluations/list-evaluations).

## Cost Optimization

Mixpeek uses a credit-based pricing model. Key cost drivers:

| Operation             | Cost Level | Optimization                                                                                                                                    |
| --------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| Feature extraction    | High       | Choose the right extractor — don't over-extract                                                                                                 |
| LLM enrichment stages | High       | Set `max_tokens` limits, cache results                                                                                                          |
| Vector search         | Low        | \~0.1 credits/query (hybrid \~0.2) — cheap; the heavy cost is enrichment/LLM stages. See [Rate Limits & Quotas](/operations/rate-limits-quotas) |
| Storage               | Low        | Auto-tiered — hot/warm/cold based on access patterns                                                                                            |

**Top optimizations:**

1. **Deduplicate before ingesting** — skip objects already processed
2. **Use field passthrough** for metadata that doesn't need extraction
3. **Batch process** rather than single-object ingestion
4. **Cache retriever results** for repeated query patterns
5. **Set reranking limits** (`top_k`) to avoid scoring too many candidates
