Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Collections
Collections bind a bucket to a feature extractor. When you submit a batch, the engine runs the extractor against each object and produces searchable documents.root_object_id.
Collection API →
Embedding Task
Instruction-aware embedding models (E5, Gemini) use a task hint to optimize the embedding for a specific downstream use case. Setembedding_task at the collection level so it applies to every task-aware model in the pipeline.
| Task | Use Case | Default |
|---|---|---|
retrieval_document | Search: find documents from queries | Yes |
retrieval_query | Rare at index time — query-side is automatic | No |
semantic_similarity | Symmetric comparison (dedup, matching) | No |
classification | Document categorization pipelines | No |
clustering | Grouping documents into clusters | No |
You almost never need to set this. The default
retrieval_document is correct for search, and at query time Mixpeek automatically uses retrieval_query. Only override for clustering, classification, or symmetric similarity.Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this setting.
Feature URIs
Every extracted feature is addressed by a URI that pins it to a specific extractor version:mixpeek://multimodal_extractor@v1/multimodal_embeddingmixpeek://text_extractor@v1/text_embeddingmixpeek://face_detector@v1/face_embedding
Tiered Pipelines
When a batch is submitted, the engine runs a DAG of extractors:- Tier 1 collections process raw objects from the bucket
- Tier 2 collections consume Tier 1 documents as input
- Each tier waits for dependencies before executing
source and feature_extractor configuration. Dependencies are resolved automatically.
Built-in Extractors
| Extractor | Modality | Output |
|---|---|---|
| Multimodal | Video, image, audio | Vertex AI 1408D embeddings, transcripts, scene descriptions |
| Text | Text | E5-Large 1024D embeddings |
| Image | Image | SigLIP 768D embeddings, descriptions, structured extraction |
| Face Identity | Video, image | ArcFace 512D face embeddings, bounding boxes |
| Document | PDF, DOCX | Text chunks, OCR, embeddings |
| Gemini Multi-file | Any | Gemini-powered cross-file analysis |
| Web Scraper | URLs | Scraped text content + embeddings |
| Course Content | Video | Lecture segments, slides, transcripts |
| Passthrough | Any | Forward metadata without extraction |

