Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Collections
Collections bind a bucket to a feature extractor. When you submit a batch, the engine runs the extractor against each object and produces searchable documents.
curl -X POST "https://api.mixpeek.com/v1/collections" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE_ID" \
-H "Content-Type: application/json" \
-d '{
"collection_name": "product-embeddings",
"source": { "type": "bucket", "bucket_id": "'$BUCKET_ID'" },
"feature_extractor": {
"feature_extractor_name": "multimodal_extractor",
"version": "v1",
"input_mappings": { "image": "payload.hero_image", "text": "payload.product_text" }
}
}'
A single object can feed multiple collections — each running a different extractor. Documents retain lineage to the source object via root_object_id.
Collection API →
Feature URIs
Every extracted feature is addressed by a URI that pins it to a specific extractor version:
mixpeek://{extractor_name}@{version}/{output_name}
Examples:
mixpeek://multimodal_extractor@v1/multimodal_embedding
mixpeek://text_extractor@v1/text_embedding
mixpeek://face_detector@v1/face_embedding
Feature URIs are referenced by retriever stages, taxonomies, and clustering jobs. They guarantee query-time compatibility with the extraction pipeline — swap the URI, re-embed, everything downstream stays consistent.
Tiered Pipelines
When a batch is submitted, the engine runs a DAG of extractors:
- Tier 1 collections process raw objects from the bucket
- Tier 2 collections consume Tier 1 documents as input
- Each tier waits for dependencies before executing
video → scenes (Tier 1) → faces per scene (Tier 2) → expressions per face (Tier 3)
Collections define the pipeline through their source and feature_extractor configuration. Dependencies are resolved automatically.
| Extractor | Modality | Output |
|---|
| Multimodal | Video, image, audio | Vertex AI 1408D embeddings, transcripts, scene descriptions |
| Text | Text | E5-Large 1024D embeddings |
| Image | Image | SigLIP 768D embeddings, descriptions, structured extraction |
| Face Identity | Video, image | ArcFace 512D face embeddings, bounding boxes |
| Document | PDF, DOCX | Text chunks, OCR, embeddings |
| Gemini Multi-file | Any | Gemini-powered cross-file analysis |
| Web Scraper | URLs | Scraped text content + embeddings |
| Course Content | Video | Lecture segments, slides, transcripts |
| Passthrough | Any | Forward metadata without extraction |
See the full Extractor Reference for configuration details.
For extraction logic beyond built-in models, build custom extractors:
pip install mixpeek
mixpeek plugin init my-extractor # Scaffold from template
mixpeek plugin test my-extractor # Validate locally
mixpeek plugin publish my-extractor # Upload and deploy
Custom extractors run on managed infrastructure with access to GPU/CPU resources, HuggingFace models, and LLM services. They support batch processing, real-time endpoints, and custom model loading.
See the full extractors guide for manifest format, pipeline hooks, security constraints, and deployment lifecycle.
Extractor API →