Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Custom extractor flow: a single extractor archive powers both the ingestion pipeline (batch Ray Data) and the retrieval realtime endpoint (Ray Serve HTTP), used by retriever stages like feature_search, rerank, and agentic_enrich
Enterprise only. Custom extractors are gated to Enterprise plans. Batch-only deployments and real-time inference endpoints both require Enterprise infrastructure. Contact your account team to enable extractor uploads for your organization.
Custom extractors let you run your own code on Mixpeek infrastructure — inside the same Ray cluster that powers the built-in extractors and retriever stages. You keep full control of the logic, model, and I/O; Mixpeek handles packaging, scheduling, GPU allocation, caching, and observability.
You can also publish your extractors to the Extractor Marketplace for other organizations to discover and install, or install community extractors built by others.

Delivery Formats

Ship your extractor in either of two formats:
FormatWhen to Use
Zip archive (.zip)Pure-Python extractors. Mixpeek resolves dependencies against the managed runtime. Upload via presigned URL, scanned by the security linter before deploy. Limit: 500 MB / 1,000 files.
Container image (OCI)Extractors that need system packages, custom CUDA builds, compiled binaries, or non-Python runtimes. Base your image on the Mixpeek engine image, push to your org-scoped Artifact Registry repo (plugins-<org-id>/), and set container_image in manifest.py. See BYO Container Image.
Both formats go through the same upload → confirm → deploy lifecycle and expose the same runtime APIs (batch __call__, real-time run_inference, platform LLM/secret accessors).

What You Can Build

Custom extractors plug into two places in the warehouse:

1. Feature Extractors (Decomposition)

Attach custom logic to a collection’s feature_extractor so every ingested object flows through your pipeline during decomposition. Use this to:
  • Embed domain-specific content with your own model (fine-tuned CLIP, proprietary audio encoder, etc.)
  • Extract structured attributes via a VLM you manage (brand compliance, regulated content classification)
  • Transcribe, OCR, or segment media with a custom pre/post-processing chain
  • Produce multiple named vector indexes from a single pass
Outputs land in MVS and MongoDB with the same feature URI scheme as built-in extractors (mixpeek://my_extractor@1.0.0/my_embedding), so retrievers, taxonomies, and clusters can reference them.

2. Retriever Operations (Query Time)

An extractor’s realtime.py exposes a Ray Serve HTTP endpoint that retriever stages can call during execution. Use this to power:
  • feature_search — embed queries at search time with the same model you used during ingestion, so the query vector lives in the same space as the indexed vectors
  • Inference operations — on-the-fly classification, scoring, or re-ranking against your model
  • LLM calls — wrap a hosted or private LLM behind a stable contract, with platform-managed secrets and cost tracking
  • Classifiers — apply your own classifier to candidate results mid-pipeline
This is what lets a single custom extractor own both halves of a retrieval flow: it encodes documents on the way in, and encodes the query on the way out.

Version Management

Extractors support a git-like workflow for iterating on deployed code:
CommandDescription
lintValidate manifest and run security scanner locally (no API key needed)
testRun pipeline through Ray Data test harness locally (no API key needed)
pullDownload the active version’s source files to a local directory
pushZip, upload, and confirm a new version (auto-bumps patch version if omitted)
logShow version history with deploy timestamps and commit messages
statusShow active version, extractor ID, and security scan status
rollbackRestore a previous version as active
diffCompare source files between two versions
Use the CLI at server/scripts/api/plugins.py or call the version management endpoints directly:
  • GET /v1/namespaces/{ns}/extractors/by-name/{slug}/versions — version history
  • POST /v1/namespaces/{ns}/extractors/by-name/{slug}/rollback — rollback
  • GET /v1/namespaces/{ns}/extractors/by-name/{slug}/diff?v1=1.0.0&v2=1.0.1 — diff
  • GET /v1/namespaces/{ns}/extractors/{id}/source — download source files

Pre-installed Tools

The engine runtime includes media processing tools available via run_tool from the Extractor SDK:
ToolFormatDescription
ffmpeg / ffprobeStandard video/audioTranscode, extract frames, probe metadata
REDlineRED R3DDecode RED cinema camera raw files to ProRes/DPX/EXR
art-cmdARRI RAWDecode ARRI raw (.ari/.arriraw/.arx) to ProRes
exiftoolAll mediaRead/write EXIF and XMP metadata
mediainfoAll mediaDetailed format and codec inspection
convert / identifyImagesImageMagick image processing
sox / soxiAudioAudio processing and info
See Cinema Camera Raw File Support for usage examples.

Next Steps

Quickstart

Build, upload, and deploy a minimal text embedding extractor end-to-end.

Extractor Developer Guide

Full reference: manifest format, DataFrame schema, model loading, security rules, batch optimization, examples.

Upload & Deploy API

Presigned uploads, confirm, deploy, status, undeploy, delete.

Realtime Inference API

Call a deployed extractor’s realtime.py endpoint for retriever operations.