Browse the extractor catalog on GitHub
Runnable reference for every built-in Mixpeek extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry, so it always matches production.
Availability
What You Can Build
Custom extractors plug into two places in the warehouse:1. Feature Extractors (Decomposition)
Attach custom logic to a collection’sfeature_extractor so every ingested object flows through your pipeline during decomposition. Use this to:
- Embed domain-specific content with your own model (fine-tuned CLIP, proprietary audio encoder, etc.)
- Extract structured attributes via a VLM you manage (brand compliance, regulated content classification)
- Transcribe, OCR, or segment media with a custom pre/post-processing chain
- Produce multiple named vector indexes from a single pass
mixpeek://my_extractor@1.0.0/my_embedding), so retrievers, taxonomies, and clusters can reference them.
2. Retriever Operations (Query Time)
An extractor’srealtime.py exposes a Ray Serve HTTP endpoint that retriever stages can call during execution. Use this to power:
feature_search— embed queries at search time with the same model you used during ingestion, so the query vector lives in the same space as the indexed vectors- Inference operations — on-the-fly classification, scoring, or re-ranking against your model
- LLM calls — wrap a hosted or private LLM behind a stable contract, with platform-managed secrets and cost tracking
- Classifiers — apply your own classifier to candidate results mid-pipeline
Delivery Formats
Ship your extractor in either of two formats:| Format | When to Use |
|---|---|
Zip archive (.zip) | Pure-Python extractors. Mixpeek resolves dependencies against the managed runtime. Scanned by the security linter before deploy. Limit: 500 MB / 1,000 files. |
| Container image (OCI) | Extractors that need system packages, custom CUDA builds, compiled binaries, or non-Python runtimes. Base your image on the Mixpeek engine image, push to your org-scoped Artifact Registry repo, and set container_image in manifest.py. See BYO Container Image. |
__call__, real-time run_inference, platform LLM/secret accessors).
Extractor Structure
Every extractor has the same layout:manifest.pydeclares what your extractor accepts, produces, and which vector indexes to createpipeline.pywires your processor into the Ray Data batch pipelinerealtime.pyexposes a Ray Serve endpoint for query-time inference (e.g., embedding queries forfeature_search)processors/contains your actual logic — model loading, embedding, classification, etc.
Manifest
The manifest is your extractor’s contract with the platform.Vector Index Keys
| Key | Required | Description |
|---|---|---|
feature_type | Yes | Must be "embedding" |
feature_name | Yes | Name of the vector index |
embedding_dim | Yes | Vector dimensionality |
distance_metric | Yes | "cosine", "euclid", or "dot" |
Inference Type
Declare what kind of real-time inference your extractor provides by settinginference_type in metadata. This lets retriever stages validate that an extractor is compatible with the stage slot.
| Value | Contract | Compatible Stages |
|---|---|---|
embedding | Returns {vector: [float]} | feature_search |
rerank | Accepts {pairs: [[q, d]]}, returns {scores: [float]} | rerank |
classify | Accepts {text: str}, returns {labels: [{label, confidence}]} | classify |
generate | Accepts {prompt: str}, returns {text: str} | llm_filter, llm_enrich |
general | No specific contract | Raw /inference endpoint only |
inference_type is omitted, the extractor can only be called via the raw inference endpoint.
Compute Profile
Control resource allocation by addingcompute_profile to your manifest:
BYO Container Image
If your extractor needs native binaries or system packages, specify a custom container image:Batch Processor
Your processor receives a pandas DataFrame and returns it with new columns added.DataFrame Columns
Your__call__ receives these columns:
| Column | Description |
|---|---|
data | For text blobs: the raw string. For binary blobs: an S3 URL (not raw bytes). |
document_id | Unique document ID |
object_id | Source object in the bucket |
blob_type | "image", "video", "audio", "text" |
blob_property | Property name from your bucket schema |
mime_type | MIME type (e.g. image/jpeg) |
Batched Processing
Process all rows together — never call a model or API inside a per-row loop:Loading Assets from S3
For binary blobs (images, video, audio), thedata column contains S3 URLs. Use the Extractor SDK to download them:
Pipeline
Wire your processor into the Ray Data pipeline:Row Conditions
Filter which rows a step processes:| Condition | Matches |
|---|---|
RowCondition.IS_TEXT | text/* MIME types |
RowCondition.IS_IMAGE | image/* MIME types |
RowCondition.IS_VIDEO | video/* MIME types |
RowCondition.IS_AUDIO | audio/* MIME types |
RowCondition.IS_PDF | application/pdf |
RowCondition.ALWAYS | All rows (default) |
Using Built-in Models
Compose existing Mixpeek services instead of loading models yourself:| Service | Type | Dimensions |
|---|---|---|
intfloat/multilingual-e5-large-instruct | Embedding | 1024 |
google/siglip-base-patch16-224 | Embedding | 512 |
jinaai/jina-embeddings-v2-base-code | Embedding | 768 |
BAAI/bge-reranker-v2-m3 | Reranker | — |
openai/whisper-large-v3-turbo | Transcription | — |
Real-time Endpoint
Addrealtime.py to expose an HTTP endpoint for query-time inference. This is what lets retriever feature_search stages embed queries with your model.
realtime.py handles embedding only, not retrieval. If your pipeline needs retrieval context (comparing against stored references), configure retriever stages to handle that logic. The real-time endpoint serves on dedicated infrastructure (see Availability).Platform Services
LLM Access
Usecontainer.llm to call platform-managed LLMs with built-in cost tracking and caching:
| Provider | Models |
|---|---|
google | gemini-2.5-flash, gemini-2.5-pro |
openai | gpt-4o, gpt-4o-mini |
anthropic | claude-sonnet-4-20250514 |
Secrets
Access encrypted org secrets at runtime viacontainer.secrets:
container.llm instead — it handles API keys automatically.
CLI Tools
Custom extractors can’t importsubprocess directly. Use run_tool for whitelisted CLI tools:
ffmpeg, ffprobe, convert, identify, magick, exiftool, mediainfo, sox, soxi, REDline, art-cmd
Pre-installed Tools
The engine runtime includes these media tools, available viarun_tool:
| Tool | Format | Description |
|---|---|---|
ffmpeg / ffprobe | Standard video/audio | Transcode, extract frames, probe metadata |
REDline | RED R3D | Decode RED cinema camera raw files to ProRes/DPX/EXR |
art-cmd | ARRI RAW | Decode ARRI raw (.ari/.arriraw/.arx) to ProRes |
exiftool | All media | Read/write EXIF and XMP metadata |
mediainfo | All media | Detailed format and codec inspection |
convert / identify | Images | ImageMagick image processing |
sox / soxi | Audio | Audio processing and info |
Typed SDK
The Extractor SDK provides typed base classes that replace bare Python variables with validated, IDE-friendly types — autocompletion, validation at upload time, and output column checking in the test harness. Import fromshared.extractors.sdk or shared.extractors.
setup() runs once on first batch (lazy model loading); process() runs on each batch. prefetch_hf_model() pre-downloads the model to the HF cache to reduce cold start.
SDK Reference
| Function / Class | Purpose |
|---|---|
ExtractorManifest | Typed manifest with validation |
Feature.embedding(name, dim) | Create an embedding feature definition |
Feature.classification(name, labels) | Create a classification feature definition |
BatchProcessor | Base class with setup() / process() lifecycle |
InferenceService | Base class with setup() / infer() lifecycle |
prefetch_hf_model(model_id) | Pre-download HF model to cache (cold start mitigation) |
parallel_io(items, fn, max_workers) | Parallel file downloads and I/O |
concurrent_api_calls(items, async_fn, max_concurrent) | Concurrent LLM/API calls |
open_asset(url, suffix) | Context manager for S3 downloads |
download_asset(url, suffix) | Manual S3 download with cleanup flag |
run_tool(tool, args, timeout) | Execute whitelisted CLI tools |
upload_asset(path, namespace_id, internal_id, resource_id) | Upload processed files back to S3 |
Security Rules
Custom extractors are scanned before deployment. Code violating these rules is rejected.- Allowed:
numpy,pandas,torch,transformers,sentence_transformers,onnxruntime,PIL,cv2,requests,httpx,os(safe functions only),json,re,pydantic,logging,getattr,hasattr - Blocked:
subprocess,os.system,os.popen,os.exec*,eval,exec,ctypes,socket,multiprocessing,open,setattr,delattr
import os is allowed — only dangerous functions are blocked. Library-internal file I/O (torch.load, transformers.from_pretrained, pd.read_csv) is fine since the scanner only inspects your extractor’s source code.Local Development
Validate and test your extractor locally with the CLI before uploading. No API key needed — these run fully offline and work on any plan.lint catches common mistakes before upload:
- Wrong feature key names (
nameinstead offeature_name) - Missing required fields
- Security scanner violations
test runs your processor through real Ray Data map_batches with Arrow serialization — the same path used in production. Sample rows are fed in the data column (matching production), and the harness validates that your output columns match the manifest features.
Version Management
On a dedicated deployment, the same CLI manages deployed versions with a git-like workflow:| Command | Description |
|---|---|
pull | Download the active version’s source files to a local directory |
push | Zip, upload, and confirm a new version (auto-bumps patch version if omitted) |
log | Show version history with deploy timestamps and commit messages |
status | Show active version, extractor ID, and deployment status |
rollback | Restore a previous version as active |
diff | Compare source files between two versions |
MIXPEEK_API_KEY, MIXPEEK_NAMESPACE, and MIXPEEK_API_URL from the environment (or --api-key / --namespace).
Archive Limits
| Limit | Value |
|---|---|
| Upload size | 500 MB |
| Max files | 1,000 |
End-to-End: Extractor → Collection → Retriever
This walkthrough connects all the pieces. SetMIXPEEK_API_KEY, MIXPEEK_NAMESPACE, and MIXPEEK_API_URL first — the same variables the plugins.py CLI reads.
Steps 1 and 5 (the extractor upload/deploy) require a dedicated deployment. On the shared API, use a built-in extractor (e.g.
text_extractor) for the feature_extractor in step 3 and skip steps 1 and 5.1. Deploy the Extractor (dedicated infra)
The simplest path is the CLI —python server/scripts/api/plugins.py push then deploy. The equivalent raw HTTP (custom extractors are addressed as /plugins on this surface; see the Custom Extractor API reference):
2. Create a Bucket and Upload Data
Buckets, collections, retrievers, and batches are top-level resources addressed by the
X-Namespace header — not nested under /namespaces/{ns}/. (Only extractors and models are path-scoped.)3. Create a Collection with the Extractor
Every object uploaded to the bucket flows through this extractor’s batch processor.4. Process the Data (two-step batch)
A batch is created (with the objects + target collections) and then submitted:5. Build a Retriever and Search
At query time the retriever calls your extractor’srealtime.py (dedicated infra) to embed the query, then searches the vectors your batch processor produced.
Troubleshooting
Upload/deploy endpoints return 404 or 405
Upload/deploy endpoints return 404 or 405
The upload/deploy/realtime lifecycle is only available on a dedicated deployment — the shared
api.mixpeek.com exposes only GET list/details for extractors. Develop + lint + test locally, then either provision a dedicated deployment or ship via Submissions.Task COMPLETED but 0 documents
Task COMPLETED but 0 documents
Usually wrong
features key names in manifest.py. Run python server/scripts/api/plugins.py lint path/to/my_extractor to validate. Check that your collection has non-empty vector_indexes via GET /v1/collections/{id}. See Vector Index Keys.Batch produces 0 documents
Batch produces 0 documents
Most common cause: reading from the wrong column. Always use
batch["data"], not batch["text"] or other property names. Check Ray logs for [FailureAggregator] entries.Extractor validation failed
Extractor validation failed
Check
validation_errors. Common issues: using subprocess (use run_tool), using open() directly (use library I/O), using eval/exec (use json.loads).Model loading is slow
Model loading is slow
Use
prefetch_hf_model() in your setup() method to pre-download models to the HF cache. On GKE, HF_HOME points to a shared PVC so models persist across pod restarts. First cold start downloads (~1-2 min); subsequent starts use cache.First run is slow / first search returns 0 (cold start)
First run is slow / first search returns 0 (cold start)
On a cold engine, the embedding model loads on demand — the first batch can take several minutes to leave
PROCESSING, and the first retriever execute afterward may return 0 results (status completed or degraded) while the query-side model warms. This is a cold-start artifact, not a real no-match: retry after a few seconds and results appear. A warm namespace responds immediately. Keep a namespace warm by issuing a periodic lightweight query, or contact your account team about a warm replica floor for latency-sensitive workloads.feature_uri without realtime.py
feature_uri without realtime.py
If your manifest includes a
feature_uri, the system expects a corresponding realtime.py. Without it, feature_search queries against that URI will fail. Omit both if you only need batch processing.Next Steps
Quickstart
Build, test, and query a minimal text embedding extractor end-to-end.
Model Registry
Load HuggingFace models or your own fine-tuned weights inside an extractor.
Extractor Submissions
Submit your extractor for review to be merged into the built-in catalog.
Multi-Tier Extraction
Chain collections into a DAG — transcribe, then embed, then classify.
Reprocess Existing Content
Run a new extractor over an already-ingested corpus — scoped, cost-safe, priced before you run.

