Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
This page covers the core concept — how a single extractor is configured and what it produces. For chaining extractors across multiple collections in a DAG, see Multi-Tier Feature Extraction. For the catalog of built-in extractors, see the pages under Built-in Feature Extractors. For BYO extraction, see Custom Extractors.
Anatomy of a Feature Extractor Config
Extractors are attached to a collection via the singularfeature_extractor field:
feature_extractor_name + version
Identifies which extractor and which version to run. Versions are pinned per collection — a new v2 ships without disturbing collections on v1. You roll forward deliberately via a namespace migration, not in place.
input_mappings — Bind Extractor Inputs to Object Fields
Every extractor declares an input_schema (e.g., text_extractor expects a text input). input_mappings bind those named inputs to JSONPath-like paths in your source object: metadata.category, $.payload.description, etc. Without the mapping, the extractor has no idea which field holds its input.
field_passthrough — Carry Source Fields into Documents
Extractor outputs alone are rarely enough at query time. You want to filter by metadata.category or status without re-joining against the source bucket. field_passthrough copies selected source fields onto every output document so retrievers can use them directly in metadata filters.
parameters — Tune the Extractor
Parameters are extractor-specific: chunk strategy, embedding model variant, OCR enable/disable, transcription language, thumbnail quality. Each extractor’s reference page documents its parameter schema.
The Output Schema
When you configure an extractor, the collection immediately calculates anoutput_schema that merges passthrough fields with extractor outputs. You can inspect this before any processing runs:
Feature URIs: The Stable Contract
Every extractor output publishes a URI in the formmixpeek://{name}@{version}/{output}:
mixpeek://text_extractor@v1/text_embeddingmixpeek://multimodal_extractor@v1/scene_embeddingmixpeek://multimodal_extractor@v1/transcriptionmixpeek://document_extractor@v1/ocr_text
- Retrievers reference features by URI in
feature_searchstages. - Taxonomies classify against a feature URI.
- Clusters group documents by a feature URI.
- Migrations swap one URI for another across retrievers atomically.
v1 keeps working even after v2 is released — it simply doesn’t see v2 features unless you explicitly migrate.
Vector Index Registration
When an extractor publishes an embedding, Mixpeek registers a vector index in MVS at the matching URI. Retrievers then query that exact index viafeature_search.
For built-in extractors this is automatic — the vector index is declared in the extractor’s definition. For custom extractors, it’s the single biggest pitfall: the features list in manifest.py must use the exact keys feature_type, feature_name, embedding_dim, distance_metric. Intuitive-but-wrong names (type, name, dimensions, distance) silently create a collection with zero vector indexes, and the ingestion task reports COMPLETED with 0 documents written.
What Happens at Runtime
For a single-extractor collection, the end-to-end path is:- Flatten. API flattens the manifest into per-extractor row artifacts (Parquet) and stores them in S3.
- Schedule. Ray poller discovers pending batches and submits a job.
- Run. Workers load the dataset, run the extractor flow (GPU if available), and emit features plus passthrough fields.
- Write.
QdrantBatchProcessorwrites vectors and payloads to MVS, emits webhook events, and updates index signatures.
processing_tier field becomes load-bearing — see Multi-Tier Feature Extraction for how the DAG executes.
Performance and Scaling
- GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing. CPU-only extractors skip GPU allocation (saves ~3 minutes of cluster startup and ~6× on cost).
- Ray Data handles batching, shuffling, and parallelization automatically. Default batch size is 64 rows; tune via the extractor’s compute profile.
- Autoscaling maintains target utilization (
0.7CPU,0.8GPU by default). - Inference cache short-circuits repeated model calls when inputs hash to the same key — handy when reprocessing or when ingesting near-duplicates.
Operational Checklist
- Pin versions. Upgrade deliberately via new collection + migration, never in-place.
- Batch uploads. Keep ingestion batches in the 1k–10k object range to maximize parallelism without overwhelming the scheduler.
- Use
field_passthroughfor every metadata filter. If you plan to filter bycategoryat query time, pass it through at ingestion time. - Inspect features before querying.
GET /v1/collections/{id}/featuresreturns the feature URIs, dimensions, and descriptions actually registered — use this to confirm the extractor ran correctly before building retrievers. - Watch lineage, not just counts. A task reporting
COMPLETEDwith 0 features written almost always means aninput_mappingsmistake or (for custom extractors) a badfeaturesmanifest. The Document Lineage API and thevector_indexesfield on the collection are your diagnostic tools.
Reference Catalog
The pages under Built-in Feature Extractors document every built-in extractor — input types, output schema, parameters, latency, and cost.| Extractor | Description | Input Types | Dimensions | Latency | Cost |
|---|---|---|---|---|---|
passthrough_extractor_v1 | Copies source fields without processing. Use for passing metadata or vectors between collections. | Any | N/A | <1ms | Free |
text_extractor_v1 | Dense embeddings via E5-Large multilingual. Supports chunking and LLM extraction via response_shape. | text, string | 1024 | ~5ms/doc | Free |
multimodal_extractor_v1 | Unified embeddings for video, image, text, and GIF via Google Vertex AI. Videos decomposed by time, scene, or silence with transcription, OCR, and thumbnails. | video, image, text, string | 1408 | 0.5–2× realtime | $0.01–0.15/min |

