> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Custom Extractors > Bring your own container image or zip archive to run feature extractors and retriever operations on Mixpeek infrastructure Runnable reference for every built-in Mixpeek extractor — inputs, parameters, output fields, embedding models, and copy-paste examples. Auto-generated from the live registry, so it always matches production. Custom extractor flow: a single extractor archive powers both the ingestion pipeline (batch Ray Data) and the retrieval realtime endpoint (Ray Serve HTTP), used by retriever stages like feature_search, rerank, and agentic_enrich

Custom extractor flow: a single extractor archive powers both the ingestion pipeline (batch Ray Data) and the retrieval realtime endpoint (Ray Serve HTTP), used by retriever stages like feature_search, rerank, and agentic_enrich

**Enterprise only.** Custom extractors are gated to Enterprise plans. Batch-only deployments and real-time inference endpoints both require Enterprise infrastructure. [Contact your account team](https://mixpeek.com/contact) to enable extractor uploads for your organization. Custom extractors let you run your own code on Mixpeek infrastructure — inside the same Ray cluster that powers the built-in extractors and retriever stages. You keep full control of the logic, model, and I/O; Mixpeek handles packaging, scheduling, GPU allocation, caching, and observability. Built something broadly useful? You can submit it for review to be merged into the built-in extractor catalog — see [Extractor Submissions](/processing/extractor-marketplace). ## Delivery Formats Ship your extractor in either of two formats: | Format | When to Use | | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Zip archive** (`.zip`) | Pure-Python extractors. Mixpeek resolves dependencies against the managed runtime. Upload via presigned URL, scanned by the security linter before deploy. Limit: 500 MB / 1,000 files. | | **Container image** (OCI) | Extractors that need system packages, custom CUDA builds, compiled binaries, or non-Python runtimes. Base your image on the Mixpeek engine image, push to your org-scoped Artifact Registry repo (`plugins-/`), and set `container_image` in `manifest.py`. See [BYO Container Image](/processing/extractor-developer-guide#byo-container-image). | Both formats go through the same upload → confirm → deploy lifecycle and expose the same runtime APIs (batch `__call__`, real-time `run_inference`, platform LLM/secret accessors). ## What You Can Build Custom extractors plug into two places in the warehouse: ### 1. Feature Extractors (Decomposition) Attach custom logic to a collection's `feature_extractor` so every ingested object flows through your pipeline during decomposition. Use this to: * Embed domain-specific content with your own model (fine-tuned CLIP, proprietary audio encoder, etc.) * Extract structured attributes via a VLM you manage (brand compliance, regulated content classification) * Transcribe, OCR, or segment media with a custom pre/post-processing chain * Produce multiple named vector indexes from a single pass Outputs land in MVS and MongoDB with the same feature URI scheme as built-in extractors (`mixpeek://my_extractor@1.0.0/my_embedding`), so retrievers, taxonomies, and clusters can reference them. ### 2. Retriever Operations (Query Time) An extractor's `realtime.py` exposes a Ray Serve HTTP endpoint that retriever stages can call during execution. Use this to power: * **`feature_search`** — embed queries at search time with the same model you used during ingestion, so the query vector lives in the same space as the indexed vectors * **Inference operations** — on-the-fly classification, scoring, or re-ranking against your model * **LLM calls** — wrap a hosted or private LLM behind a stable contract, with platform-managed secrets and cost tracking * **Classifiers** — apply your own classifier to candidate results mid-pipeline This is what lets a single custom extractor own both halves of a retrieval flow: it encodes documents on the way in, and encodes the query on the way out. ## Version Management Extractors support a git-like workflow for iterating on deployed code: | Command | Description | | ---------- | ---------------------------------------------------------------------------- | | `lint` | Validate manifest and run security scanner locally (no API key needed) | | `test` | Run pipeline through Ray Data test harness locally (no API key needed) | | `pull` | Download the active version's source files to a local directory | | `push` | Zip, upload, and confirm a new version (auto-bumps patch version if omitted) | | `log` | Show version history with deploy timestamps and commit messages | | `status` | Show active version, extractor ID, and security scan status | | `rollback` | Restore a previous version as active | | `diff` | Compare source files between two versions | Use the CLI at `server/scripts/api/plugins.py` or call the version management endpoints directly: * `GET /v1/namespaces/{ns}/extractors/by-name/{slug}/versions` — version history * `POST /v1/namespaces/{ns}/extractors/by-name/{slug}/rollback` — rollback * `GET /v1/namespaces/{ns}/extractors/by-name/{slug}/diff?v1=1.0.0&v2=1.0.1` — diff * `GET /v1/namespaces/{ns}/extractors/{id}/source` — download source files ## Pre-installed Tools The engine runtime includes media processing tools available via `run_tool` from the Extractor SDK: | Tool | Format | Description | | ---------------------- | -------------------- | ---------------------------------------------------- | | `ffmpeg` / `ffprobe` | Standard video/audio | Transcode, extract frames, probe metadata | | `REDline` | RED R3D | Decode RED cinema camera raw files to ProRes/DPX/EXR | | `art-cmd` | ARRI RAW | Decode ARRI raw (.ari/.arriraw/.arx) to ProRes | | `exiftool` | All media | Read/write EXIF and XMP metadata | | `mediainfo` | All media | Detailed format and codec inspection | | `convert` / `identify` | Images | ImageMagick image processing | | `sox` / `soxi` | Audio | Audio processing and info | See [Cinema Camera Raw File Support](/processing/extractor-developer-guide#cinema-camera-raw-file-support) for usage examples. ## Next Steps Build, upload, and deploy a minimal text embedding extractor end-to-end. Full reference: manifest format, DataFrame schema, model loading, security rules, batch optimization, examples. Presigned uploads, confirm, deploy, status, undeploy, delete. Call a deployed extractor's `realtime.py` endpoint for retriever operations.