> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom Extractors

> Bring your own container image or zip archive to run feature extractors and retriever operations on Mixpeek infrastructure

<Card title="Browse the extractor catalog on GitHub" icon="github" href="https://github.com/mixpeek/mixpeek-extractors" horizontal>
  Runnable reference for every built-in Mixpeek extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry, so it always matches production.
</Card>

<Frame>
  <img src="https://mintcdn.com/mixpeek/JZ8e-UKMKuc4FtNv/assets/mixpeek-custom-plugins.svg?fit=max&auto=format&n=JZ8e-UKMKuc4FtNv&q=85&s=1f1d4f9fe0d081825609873010e514fc" alt="Custom extractor flow: a single extractor archive powers both the ingestion pipeline (batch Ray Data) and the retrieval realtime endpoint (Ray Serve HTTP), used by retriever stages like feature_search, rerank, and agentic_enrich" width="1100" height="560" data-path="assets/mixpeek-custom-plugins.svg" />
</Frame>

<Warning>
  **Enterprise only.** Custom extractors are gated to Enterprise plans. Batch-only deployments and real-time inference endpoints both require Enterprise infrastructure. [Contact your account team](https://mixpeek.com/contact) to enable extractor uploads for your organization.
</Warning>

Custom extractors let you run your own code on Mixpeek infrastructure — inside the same Ray cluster that powers the built-in extractors and retriever stages. You keep full control of the logic, model, and I/O; Mixpeek handles packaging, scheduling, GPU allocation, caching, and observability.

<Tip>
  Built something broadly useful? You can submit it for review to be merged into the built-in extractor catalog — see [Extractor Submissions](/processing/extractor-marketplace).
</Tip>

## Delivery Formats

Ship your extractor in either of two formats:

| Format                    | When to Use                                                                                                                                                                                                                                                                                                                                               |
| ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Zip archive** (`.zip`)  | Pure-Python extractors. Mixpeek resolves dependencies against the managed runtime. Upload via presigned URL, scanned by the security linter before deploy. Limit: 500 MB / 1,000 files.                                                                                                                                                                   |
| **Container image** (OCI) | Extractors that need system packages, custom CUDA builds, compiled binaries, or non-Python runtimes. Base your image on the Mixpeek engine image, push to your org-scoped Artifact Registry repo (`plugins-<org-id>/`), and set `container_image` in `manifest.py`. See [BYO Container Image](/processing/extractor-developer-guide#byo-container-image). |

Both formats go through the same upload → confirm → deploy lifecycle and expose the same runtime APIs (batch `__call__`, real-time `run_inference`, platform LLM/secret accessors).

## What You Can Build

Custom extractors plug into two places in the warehouse:

### 1. Feature Extractors (Decomposition)

Attach custom logic to a collection's `feature_extractor` so every ingested object flows through your pipeline during decomposition. Use this to:

* Embed domain-specific content with your own model (fine-tuned CLIP, proprietary audio encoder, etc.)
* Extract structured attributes via a VLM you manage (brand compliance, regulated content classification)
* Transcribe, OCR, or segment media with a custom pre/post-processing chain
* Produce multiple named vector indexes from a single pass

Outputs land in MVS and MongoDB with the same feature URI scheme as built-in extractors (`mixpeek://my_extractor@1.0.0/my_embedding`), so retrievers, taxonomies, and clusters can reference them.

### 2. Retriever Operations (Query Time)

An extractor's `realtime.py` exposes a Ray Serve HTTP endpoint that retriever stages can call during execution. Use this to power:

* **`feature_search`** — embed queries at search time with the same model you used during ingestion, so the query vector lives in the same space as the indexed vectors
* **Inference operations** — on-the-fly classification, scoring, or re-ranking against your model
* **LLM calls** — wrap a hosted or private LLM behind a stable contract, with platform-managed secrets and cost tracking
* **Classifiers** — apply your own classifier to candidate results mid-pipeline

This is what lets a single custom extractor own both halves of a retrieval flow: it encodes documents on the way in, and encodes the query on the way out.

## Version Management

Extractors support a git-like workflow for iterating on deployed code:

| Command    | Description                                                                  |
| ---------- | ---------------------------------------------------------------------------- |
| `lint`     | Validate manifest and run security scanner locally (no API key needed)       |
| `test`     | Run pipeline through Ray Data test harness locally (no API key needed)       |
| `pull`     | Download the active version's source files to a local directory              |
| `push`     | Zip, upload, and confirm a new version (auto-bumps patch version if omitted) |
| `log`      | Show version history with deploy timestamps and commit messages              |
| `status`   | Show active version, extractor ID, and security scan status                  |
| `rollback` | Restore a previous version as active                                         |
| `diff`     | Compare source files between two versions                                    |

Use the CLI at `server/scripts/api/plugins.py` or call the version management endpoints directly:

* `GET /v1/namespaces/{ns}/extractors/by-name/{slug}/versions` — version history
* `POST /v1/namespaces/{ns}/extractors/by-name/{slug}/rollback` — rollback
* `GET /v1/namespaces/{ns}/extractors/by-name/{slug}/diff?v1=1.0.0&v2=1.0.1` — diff
* `GET /v1/namespaces/{ns}/extractors/{id}/source` — download source files

## Pre-installed Tools

The engine runtime includes media processing tools available via `run_tool` from the Extractor SDK:

| Tool                   | Format               | Description                                          |
| ---------------------- | -------------------- | ---------------------------------------------------- |
| `ffmpeg` / `ffprobe`   | Standard video/audio | Transcode, extract frames, probe metadata            |
| `REDline`              | RED R3D              | Decode RED cinema camera raw files to ProRes/DPX/EXR |
| `art-cmd`              | ARRI RAW             | Decode ARRI raw (.ari/.arriraw/.arx) to ProRes       |
| `exiftool`             | All media            | Read/write EXIF and XMP metadata                     |
| `mediainfo`            | All media            | Detailed format and codec inspection                 |
| `convert` / `identify` | Images               | ImageMagick image processing                         |
| `sox` / `soxi`         | Audio                | Audio processing and info                            |

See [Cinema Camera Raw File Support](/processing/extractor-developer-guide#cinema-camera-raw-file-support) for usage examples.

## Next Steps

<CardGroup cols={2}>
  <Card title="Quickstart" icon="rocket" href="/tutorials/custom-extractor-quickstart">
    Build, upload, and deploy a minimal text embedding extractor end-to-end.
  </Card>

  <Card title="Extractor Developer Guide" icon="book" href="/processing/extractor-developer-guide">
    Full reference: manifest format, DataFrame schema, model loading, security rules, batch optimization, examples.
  </Card>

  <Card title="Upload & Deploy API" icon="cloud-arrow-up" href="/api-reference/custom-plugins-namespace/upload-a-custom-plugin">
    Presigned uploads, confirm, deploy, status, undeploy, delete.
  </Card>

  <Card title="Realtime Inference API" icon="bolt" href="/api-reference/custom-plugins-namespace/test-plugin-realtime-inference">
    Call a deployed extractor's `realtime.py` endpoint for retriever operations.
  </Card>
</CardGroup>
