Skip to main content
A single extractor call gives you one pass over raw data. That works for homogeneous inputs and a single model. It falls apart the moment you need to transcribe audio and then embed the transcription, or OCR a PDF and then chunk the text and then classify each chunk. Multi-tier extraction solves this by turning ingestion into a declarative DAG: collections chain together, each tier reads the output of the previous one, and the engine schedules them in the correct order.
This is the composition story for feature extractors. For the core concept — how a single extractor is configured — see Feature Extractors. For the catalog, see the pages under Built-in Feature Extractors.

Why One Extractor Isn’t Enough

Three problems force you out of the single-extractor model: 1. Heterogeneous inputs. A single 30-minute video is not one document — it’s dozens of segments, each with its own transcription, thumbnail, embedding, and OCR. A PDF is N pages with layout, text chunks, and figures. One extractor can’t own both the decomposition and every downstream enrichment without becoming an unmaintainable god-model. 2. Model boundaries. Transcription is an audio model. Text embedding is a text model. Taxonomy classification is an LLM. Each has different hardware (GPU vs API), different batch sizes, different failure modes. Trying to run them in a single actor wastes GPU time on I/O-bound work and vice versa. 3. Reusability. You want to build one transcription step and reuse the output for embedding, for summarization, for diarization, for taxonomy classification. If transcription is buried inside a monolithic extractor, every downstream use triggers a redundant pass. Multi-tier extraction moves composition into the ingestion graph itself. One extractor per job, one concern per extractor, composed via collections.

The Object → Document → Feature Model

Three primitives do all the work in Mixpeek:
PrimitiveWhat It IsWhy It Matters
ObjectRaw file or record in a bucket (a video, a PDF, a row of JSON).The input boundary. You ingest objects.
DocumentOne row of output in a collection, produced by decomposition.The query boundary. You search documents.
FeatureA named output attached to a document (embedding, transcription, OCR text, label, score).The composition boundary. Retrievers, taxonomies, and clusters reference features by URI.
The pipeline is always:
Object (bucket) → Decomposition → Document (collection) → Feature extractor → Feature (MVS + Mongo)
Decomposition is the step people miss. A 30-minute video isn’t one document — it’s dozens of segments with their own transcriptions, thumbnails, and embeddings. A PDF isn’t one document — it’s N pages with OCR, layout, and text chunks. The extractor decides the decomposition strategy (time/scene/silence for video; page/paragraph/sentence for text), and every resulting document gets its own row. Retrievers search at the document grain, not the object grain.

Chaining Collections Across Tiers

To chain extractors, you chain collections: collection B’s source_collection_id points at collection A, and B’s extractor reads A’s documents as input. The engine assigns a processing tier to each collection based on the dependency graph:
Tier 1  raw objects       → multimodal_extractor  → segments with transcription + scene embeddings
Tier 2  tier-1 documents  → text_extractor        → chunks with text embeddings
Tier 3  tier-2 documents  → taxonomy_extractor    → classifications per chunk
Tier 1 runs first on raw objects. Tier 2 runs only once tier 1 has written results. Tier 3 only after tier 2. The engine respects this DAG automatically — you don’t trigger tiers manually.

Worked Example: Video → Transcription → Text Embeddings → Classifications

// Tier 1: raw video objects → segments with transcription
{
  "collection_id": "col_segments",
  "source_bucket_id": "bkt_videos",
  "feature_extractor": {
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1",
    "input_mappings": { "video": "file" },
    "parameters": { "segment_strategy": "scene", "run_transcription": true }
  }
}

// Tier 2: segment transcriptions → text chunks + embeddings
{
  "collection_id": "col_chunks",
  "source_collection_id": "col_segments",
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": { "text": "transcription" },
    "field_passthrough": [
      { "source_path": "segment_start_ms" },
      { "source_path": "segment_end_ms" },
      { "source_path": "root_object_id" }
    ],
    "parameters": { "chunk_strategy": "sentences" }
  }
}

// Tier 3: text chunks → taxonomy classifications
{
  "collection_id": "col_classified",
  "source_collection_id": "col_chunks",
  "feature_extractor": {
    "feature_extractor_name": "taxonomy_extractor",
    "version": "v1",
    "input_mappings": { "text": "text_embedding" },
    "parameters": { "taxonomy_id": "tax_topics" }
  }
}
At query time, a retriever can feature_search over col_classified (fine-grained, classified chunks) and trace each hit back to the original video segment via field_passthrough metadata.

How the Engine Schedules Tiers

  1. Flatten. The API flattens the manifest into per-extractor row artifacts (Parquet) and stores them in S3, along with a dependency graph.
  2. Poll. Ray pollers walk the graph tier-by-tier. A tier’s batch is eligible only once every upstream tier has written its MVS rows.
  3. Submit. When a tier’s dependencies are satisfied, the poller submits a Ray job for that tier. Tiers at the same depth can run in parallel.
  4. Write. Workers run the extractor, emit features and passthrough fields, and QdrantBatchProcessor writes vectors and payloads to MVS. Webhooks fire on completion.
  5. Cascade. Downstream tiers become eligible as soon as their inputs land.
This is fundamentally a pull-based scheduler: nothing runs until the data it needs exists. It means you can re-ingest a single object at the top of the DAG and the entire chain cascades automatically.

Cross-Tier Lineage

Every document carries the metadata needed to walk back up the chain:
{
  "document_id": "doc_chunk_42",
  "root_object_id": "obj_video_7",
  "source_collection_id": "col_segments",
  "source_document_id": "doc_segment_3",
  "processing_tier": 2,
  "feature_address": "mixpeek://text_extractor@v1/text_embedding"
}
Use the Document Lineage API to walk the tree in either direction:
  • Down from an object — “show me every document derived from obj_video_7” returns all tier-1 segments, all tier-2 chunks, all tier-3 classifications in a single tree.
  • Up from a document — “where did doc_chunk_42 come from?” returns the source segment, the source object, and the extractors that produced each.
The decomposition tree visualization renders this as a graph. It’s the primary tool for debugging “why is this document showing up in my retriever results?”
When a tier reports COMPLETED with fewer documents than expected, the lineage tree tells you exactly where the drop happened — which input documents failed to produce output, and which extractor was running at the time.

Decomposition Patterns

The shape of the DAG reflects the shape of your data:
PatternTier 1Tier 2+When to Use
Fan-outOne extractor decomposes objects into many documents (video → segments, PDF → pages).Per-segment enrichments (embedding, classification).Video libraries, document intelligence.
Fan-inMultiple source collections.One extractor joins or aggregates across them.Multimodal catalogs that combine text + image embeddings into a single document per product.
SidecarPrimary extractor runs.Secondary tier adds a derived feature (e.g., a summary or classification) to the same documents.Adding enrichments after the fact without reprocessing the primary content.
Versioned swapv1 extractor runs on all objects.v2 runs on the same objects in a parallel collection.A/B testing a new model version before migrating retrievers.

Migrations Across a Chain

Upgrading a tier-1 extractor is not a local change — every downstream tier reads tier-1’s output, so a schema change ripples through the chain. The safe migration pattern:
  1. Branch. Create a parallel tier-1 collection (col_segments_v2) using the new extractor version. Ingest the same objects from the same bucket.
  2. Rebuild downstream. Create tier-2 and tier-3 collections pointing at col_segments_v2 as their source. You now have a second, fully independent chain running in parallel.
  3. Evaluate. Point an evaluation dataset at both chains (two retrievers referencing the different feature URIs). Compare recall, precision, latency, and cost.
  4. Cut over. Use a namespace migration to swap retriever references from the v1 feature URIs to the v2 URIs atomically.
  5. Retire. Once traffic is stable on the new chain, delete the old collections.
This is the same blue/green pattern as service deploys, scaled up to the DAG: the feature URI is the stable address, and the migration flips which version of the chain answers to it.

Operational Gotchas

  • field_passthrough at every tier. Each tier only carries forward what you explicitly pass through. If you want metadata.category visible at tier 3, pass it through at tier 1 and tier 2 as well. Missing passthroughs are the most common source of “why is this field gone?” questions.
  • Tier-0 input_mappings always read from the bucket schema; tier-N reads from the previous collection’s document schema. They look the same but the resolution rules differ.
  • Re-ingesting a tier-1 object cascades the whole chain. Expect tier-2 and tier-3 jobs to fire automatically once tier 1 writes. This is usually what you want, but it’s worth knowing before you kick off a 10M-object reprocess.
  • Deleting a tier-1 document tombstones everything downstream. Lineage is preserved so retrievers can filter out orphans; the actual downstream features are garbage-collected asynchronously.
  • Custom plugins work at any tier. The same manifest.py / pipeline.py / realtime.py contract applies whether the plugin is reading raw objects or the output of another extractor.