Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
This is the composition story for feature extractors. For the core concept — how a single extractor is configured — see Feature Extractors. For the catalog, see the pages under Built-in Feature Extractors.
Why One Extractor Isn’t Enough
Three problems force you out of the single-extractor model: 1. Heterogeneous inputs. A single 30-minute video is not one document — it’s dozens of segments, each with its own transcription, thumbnail, embedding, and OCR. A PDF is N pages with layout, text chunks, and figures. One extractor can’t own both the decomposition and every downstream enrichment without becoming an unmaintainable god-model. 2. Model boundaries. Transcription is an audio model. Text embedding is a text model. Taxonomy classification is an LLM. Each has different hardware (GPU vs API), different batch sizes, different failure modes. Trying to run them in a single actor wastes GPU time on I/O-bound work and vice versa. 3. Reusability. You want to build one transcription step and reuse the output for embedding, for summarization, for diarization, for taxonomy classification. If transcription is buried inside a monolithic extractor, every downstream use triggers a redundant pass. Multi-tier extraction moves composition into the ingestion graph itself. One extractor per job, one concern per extractor, composed via collections.The Object → Document → Feature Model
Three primitives do all the work in Mixpeek:| Primitive | What It Is | Why It Matters |
|---|---|---|
| Object | Raw file or record in a bucket (a video, a PDF, a row of JSON). | The input boundary. You ingest objects. |
| Document | One row of output in a collection, produced by decomposition. | The query boundary. You search documents. |
| Feature | A named output attached to a document (embedding, transcription, OCR text, label, score). | The composition boundary. Retrievers, taxonomies, and clusters reference features by URI. |
Chaining Collections Across Tiers
To chain extractors, you chain collections: collection B’ssource_collection_id points at collection A, and B’s extractor reads A’s documents as input. The engine assigns a processing tier to each collection based on the dependency graph:
Worked Example: Video → Transcription → Text Embeddings → Classifications
feature_search over col_classified (fine-grained, classified chunks) and trace each hit back to the original video segment via field_passthrough metadata.
How the Engine Schedules Tiers
- Flatten. The API flattens the manifest into per-extractor row artifacts (Parquet) and stores them in S3, along with a dependency graph.
- Poll. Ray pollers walk the graph tier-by-tier. A tier’s batch is eligible only once every upstream tier has written its MVS rows.
- Submit. When a tier’s dependencies are satisfied, the poller submits a Ray job for that tier. Tiers at the same depth can run in parallel.
- Write. Workers run the extractor, emit features and passthrough fields, and
QdrantBatchProcessorwrites vectors and payloads to MVS. Webhooks fire on completion. - Cascade. Downstream tiers become eligible as soon as their inputs land.
Cross-Tier Lineage
Every document carries the metadata needed to walk back up the chain:- Down from an object — “show me every document derived from
obj_video_7” returns all tier-1 segments, all tier-2 chunks, all tier-3 classifications in a single tree. - Up from a document — “where did
doc_chunk_42come from?” returns the source segment, the source object, and the extractors that produced each.
Decomposition Patterns
The shape of the DAG reflects the shape of your data:| Pattern | Tier 1 | Tier 2+ | When to Use |
|---|---|---|---|
| Fan-out | One extractor decomposes objects into many documents (video → segments, PDF → pages). | Per-segment enrichments (embedding, classification). | Video libraries, document intelligence. |
| Fan-in | Multiple source collections. | One extractor joins or aggregates across them. | Multimodal catalogs that combine text + image embeddings into a single document per product. |
| Sidecar | Primary extractor runs. | Secondary tier adds a derived feature (e.g., a summary or classification) to the same documents. | Adding enrichments after the fact without reprocessing the primary content. |
| Versioned swap | v1 extractor runs on all objects. | v2 runs on the same objects in a parallel collection. | A/B testing a new model version before migrating retrievers. |
Migrations Across a Chain
Upgrading a tier-1 extractor is not a local change — every downstream tier reads tier-1’s output, so a schema change ripples through the chain. The safe migration pattern:- Branch. Create a parallel tier-1 collection (
col_segments_v2) using the new extractor version. Ingest the same objects from the same bucket. - Rebuild downstream. Create tier-2 and tier-3 collections pointing at
col_segments_v2as their source. You now have a second, fully independent chain running in parallel. - Evaluate. Point an evaluation dataset at both chains (two retrievers referencing the different feature URIs). Compare recall, precision, latency, and cost.
- Cut over. Use a namespace migration to swap retriever references from the
v1feature URIs to thev2URIs atomically. - Retire. Once traffic is stable on the new chain, delete the old collections.
Operational Gotchas
field_passthroughat every tier. Each tier only carries forward what you explicitly pass through. If you wantmetadata.categoryvisible at tier 3, pass it through at tier 1 and tier 2 as well. Missing passthroughs are the most common source of “why is this field gone?” questions.- Tier-0
input_mappingsalways read from the bucket schema; tier-N reads from the previous collection’s document schema. They look the same but the resolution rules differ. - Re-ingesting a tier-1 object cascades the whole chain. Expect tier-2 and tier-3 jobs to fire automatically once tier 1 writes. This is usually what you want, but it’s worth knowing before you kick off a 10M-object reprocess.
- Deleting a tier-1 document tombstones everything downstream. Lineage is preserved so retrievers can filter out orphans; the actual downstream features are garbage-collected asynchronously.
- Custom plugins work at any tier. The same
manifest.py/pipeline.py/realtime.pycontract applies whether the plugin is reading raw objects or the output of another extractor.

