Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Feature extractor pipeline showing objects being processed through ML models to produce reusable features
Feature extractors are Ray-powered workflows that read objects, run ML models, and write features into collection documents. Each extractor exposes a stable feature URI so retrievers, taxonomies, and clusters can reference the outputs with confidence.
This page covers the core concept — how a single extractor is configured and what it produces. For chaining extractors across multiple collections in a DAG, see Multi-Tier Feature Extraction. For the catalog of built-in extractors, see the pages under Built-in Feature Extractors. For BYO extraction, see Custom Extractors.

Anatomy of a Feature Extractor Config

Extractors are attached to a collection via the singular feature_extractor field:
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": {
      "text": "product_text"
    },
    "field_passthrough": [
      { "source_path": "metadata.category" },
      { "source_path": "metadata.brand" }
    ],
    "parameters": {
      "model": "multilingual-e5-large-instruct",
      "normalize": true
    }
  }
}
Four knobs control what the extractor sees and produces.

feature_extractor_name + version

Identifies which extractor and which version to run. Versions are pinned per collection — a new v2 ships without disturbing collections on v1. You roll forward deliberately via a namespace migration, not in place.

input_mappings — Bind Extractor Inputs to Object Fields

Every extractor declares an input_schema (e.g., text_extractor expects a text input). input_mappings bind those named inputs to JSONPath-like paths in your source object: metadata.category, $.payload.description, etc. Without the mapping, the extractor has no idea which field holds its input.

field_passthrough — Carry Source Fields into Documents

Extractor outputs alone are rarely enough at query time. You want to filter by metadata.category or status without re-joining against the source bucket. field_passthrough copies selected source fields onto every output document so retrievers can use them directly in metadata filters.
If you plan to filter by a field at query time, pass it through at ingestion time. Reaching back into the source bucket from a retriever stage is 100–1000× slower than a local metadata predicate.

parameters — Tune the Extractor

Parameters are extractor-specific: chunk strategy, embedding model variant, OCR enable/disable, transcription language, thumbnail quality. Each extractor’s reference page documents its parameter schema.

The Output Schema

When you configure an extractor, the collection immediately calculates an output_schema that merges passthrough fields with extractor outputs. You can inspect this before any processing runs:
curl "https://api.mixpeek.com/v1/collections/{collection_id}/features" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY"
The response lists every feature URI the extractor will produce, its type, its dimensions (for embeddings), and its description. Use this to validate downstream retrievers, taxonomies, and clusters before you ingest a single object.

Feature URIs: The Stable Contract

Every extractor output publishes a URI in the form mixpeek://{name}@{version}/{output}:
  • mixpeek://text_extractor@v1/text_embedding
  • mixpeek://multimodal_extractor@v1/scene_embedding
  • mixpeek://multimodal_extractor@v1/transcription
  • mixpeek://document_extractor@v1/ocr_text
The URI is the contract between ingestion and everything downstream:
  • Retrievers reference features by URI in feature_search stages.
  • Taxonomies classify against a feature URI.
  • Clusters group documents by a feature URI.
  • Migrations swap one URI for another across retrievers atomically.
Because the URI pins the extractor version, a retriever built against v1 keeps working even after v2 is released — it simply doesn’t see v2 features unless you explicitly migrate.
Treat feature URIs like API versions: pin them in production retrievers, and roll them forward deliberately rather than in-place.

Vector Index Registration

When an extractor publishes an embedding, Mixpeek registers a vector index in MVS at the matching URI. Retrievers then query that exact index via feature_search. For built-in extractors this is automatic — the vector index is declared in the extractor’s definition. For custom extractors, it’s the single biggest pitfall: the features list in manifest.py must use the exact keys feature_type, feature_name, embedding_dim, distance_metric. Intuitive-but-wrong names (type, name, dimensions, distance) silently create a collection with zero vector indexes, and the ingestion task reports COMPLETED with 0 documents written.

What Happens at Runtime

For a single-extractor collection, the end-to-end path is:
  1. Flatten. API flattens the manifest into per-extractor row artifacts (Parquet) and stores them in S3.
  2. Schedule. Ray poller discovers pending batches and submits a job.
  3. Run. Workers load the dataset, run the extractor flow (GPU if available), and emit features plus passthrough fields.
  4. Write. QdrantBatchProcessor writes vectors and payloads to MVS, emits webhook events, and updates index signatures.
Every document records lineage metadata so you can trace any feature back to the object that produced it:
{
  "root_object_id": "obj_123",
  "source_collection_id": "col_source",
  "processing_tier": 1,
  "feature_address": "mixpeek://text_extractor@v1/text_embedding"
}
When extractors are chained across collections, the processing_tier field becomes load-bearing — see Multi-Tier Feature Extraction for how the DAG executes.

Performance and Scaling

  • GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing. CPU-only extractors skip GPU allocation (saves ~3 minutes of cluster startup and ~6× on cost).
  • Ray Data handles batching, shuffling, and parallelization automatically. Default batch size is 64 rows; tune via the extractor’s compute profile.
  • Autoscaling maintains target utilization (0.7 CPU, 0.8 GPU by default).
  • Inference cache short-circuits repeated model calls when inputs hash to the same key — handy when reprocessing or when ingesting near-duplicates.

Operational Checklist

  1. Pin versions. Upgrade deliberately via new collection + migration, never in-place.
  2. Batch uploads. Keep ingestion batches in the 1k–10k object range to maximize parallelism without overwhelming the scheduler.
  3. Use field_passthrough for every metadata filter. If you plan to filter by category at query time, pass it through at ingestion time.
  4. Inspect features before querying. GET /v1/collections/{id}/features returns the feature URIs, dimensions, and descriptions actually registered — use this to confirm the extractor ran correctly before building retrievers.
  5. Watch lineage, not just counts. A task reporting COMPLETED with 0 features written almost always means an input_mappings mistake or (for custom extractors) a bad features manifest. The Document Lineage API and the vector_indexes field on the collection are your diagnostic tools.

Reference Catalog

The pages under Built-in Feature Extractors document every built-in extractor — input types, output schema, parameters, latency, and cost.
ExtractorDescriptionInput TypesDimensionsLatencyCost
passthrough_extractor_v1Copies source fields without processing. Use for passing metadata or vectors between collections.AnyN/A<1msFree
text_extractor_v1Dense embeddings via E5-Large multilingual. Supports chunking and LLM extraction via response_shape.text, string1024~5ms/docFree
multimodal_extractor_v1Unified embeddings for video, image, text, and GIF via Google Vertex AI. Videos decomposed by time, scene, or silence with transcription, OCR, and thumbnails.video, image, text, string14080.5–2× realtime$0.01–0.15/min
Need something the catalog doesn’t cover? Enterprise customers can bring their own models via zip or container image, running on the same Ray infrastructure with full GPU support, versioning, and observability.