Custom Extractors

Custom extractor flow: a single extractor archive powers both the ingestion pipeline (batch Ray Data) and the retrieval realtime endpoint (Ray Serve HTTP), used by retriever stages like feature_search, rerank, and agentic_enrich

Enterprise only. Custom extractors are gated to Enterprise plans. Batch-only deployments and real-time inference endpoints both require Enterprise infrastructure. Contact your account team to enable extractor uploads for your organization.

Custom extractors let you run your own code on Mixpeek infrastructure — inside the same Ray cluster that powers the built-in extractors and retriever stages. You keep full control of the logic, model, and I/O; Mixpeek handles packaging, scheduling, GPU allocation, caching, and observability.

You can also publish your extractors to the Extractor Marketplace for other organizations to discover and install, or install community extractors built by others.

Delivery Formats

Ship your extractor in either of two formats:

Format	When to Use
Zip archive (`.zip`)	Pure-Python extractors. Mixpeek resolves dependencies against the managed runtime. Upload via presigned URL, scanned by the security linter before deploy. Limit: 500 MB / 1,000 files.
Container image (OCI)	Extractors that need system packages, custom CUDA builds, compiled binaries, or non-Python runtimes. Base your image on the Mixpeek engine image, push to your org-scoped Artifact Registry repo (`plugins-<org-id>/`), and set `container_image` in `manifest.py`. See BYO Container Image.

Both formats go through the same upload → confirm → deploy lifecycle and expose the same runtime APIs (batch __call__, real-time run_inference, platform LLM/secret accessors).

What You Can Build

Custom extractors plug into two places in the warehouse:

1. Feature Extractors (Decomposition)

Attach custom logic to a collection’s feature_extractor so every ingested object flows through your pipeline during decomposition. Use this to:

Embed domain-specific content with your own model (fine-tuned CLIP, proprietary audio encoder, etc.)
Extract structured attributes via a VLM you manage (brand compliance, regulated content classification)
Transcribe, OCR, or segment media with a custom pre/post-processing chain
Produce multiple named vector indexes from a single pass

Outputs land in MVS and MongoDB with the same feature URI scheme as built-in extractors (mixpeek://my_extractor@1.0.0/my_embedding), so retrievers, taxonomies, and clusters can reference them.

2. Retriever Operations (Query Time)

An extractor’s realtime.py exposes a Ray Serve HTTP endpoint that retriever stages can call during execution. Use this to power:

feature_search — embed queries at search time with the same model you used during ingestion, so the query vector lives in the same space as the indexed vectors
Inference operations — on-the-fly classification, scoring, or re-ranking against your model
LLM calls — wrap a hosted or private LLM behind a stable contract, with platform-managed secrets and cost tracking
Classifiers — apply your own classifier to candidate results mid-pipeline

This is what lets a single custom extractor own both halves of a retrieval flow: it encodes documents on the way in, and encodes the query on the way out.

Version Management

Extractors support a git-like workflow for iterating on deployed code:

Command	Description
`lint`	Validate manifest and run security scanner locally (no API key needed)
`test`	Run pipeline through Ray Data test harness locally (no API key needed)
`pull`	Download the active version’s source files to a local directory
`push`	Zip, upload, and confirm a new version (auto-bumps patch version if omitted)
`log`	Show version history with deploy timestamps and commit messages
`status`	Show active version, extractor ID, and security scan status
`rollback`	Restore a previous version as active
`diff`	Compare source files between two versions

Use the CLI at server/scripts/api/plugins.py or call the version management endpoints directly:

GET /v1/namespaces/{ns}/extractors/by-name/{slug}/versions — version history
POST /v1/namespaces/{ns}/extractors/by-name/{slug}/rollback — rollback
GET /v1/namespaces/{ns}/extractors/by-name/{slug}/diff?v1=1.0.0&v2=1.0.1 — diff
GET /v1/namespaces/{ns}/extractors/{id}/source — download source files

Pre-installed Tools

The engine runtime includes media processing tools available via run_tool from the Extractor SDK:

Tool	Format	Description
`ffmpeg` / `ffprobe`	Standard video/audio	Transcode, extract frames, probe metadata
`REDline`	RED R3D	Decode RED cinema camera raw files to ProRes/DPX/EXR
`art-cmd`	ARRI RAW	Decode ARRI raw (.ari/.arriraw/.arx) to ProRes
`exiftool`	All media	Read/write EXIF and XMP metadata
`mediainfo`	All media	Detailed format and codec inspection
`convert` / `identify`	Images	ImageMagick image processing
`sox` / `soxi`	Audio	Audio processing and info

See Cinema Camera Raw File Support for usage examples.

Next Steps

Quickstart

Build, upload, and deploy a minimal text embedding extractor end-to-end.

Extractor Developer Guide

Full reference: manifest format, DataFrame schema, model loading, security rules, batch optimization, examples.

Upload & Deploy API

Presigned uploads, confirm, deploy, status, undeploy, delete.

Realtime Inference API

Call a deployed extractor’s realtime.py endpoint for retriever operations.

Get Started

What Mixpeek Extracts

Retrieval

Platform

Resources

Custom Extractors

Delivery Formats

What You Can Build

1. Feature Extractors (Decomposition)

2. Retriever Operations (Query Time)

Version Management

Pre-installed Tools

Next Steps

Quickstart

Extractor Developer Guide

Upload & Deploy API

Realtime Inference API

Get Started

What Mixpeek Extracts

Retrieval

Platform

Resources

Documentation Index

​Delivery Formats

​What You Can Build

​1. Feature Extractors (Decomposition)

​2. Retriever Operations (Query Time)

​Version Management

​Pre-installed Tools

​Next Steps

Quickstart

Extractor Developer Guide

Upload & Deploy API

Realtime Inference API

Delivery Formats

What You Can Build

1. Feature Extractors (Decomposition)

2. Retriever Operations (Query Time)

Version Management

Pre-installed Tools

Next Steps