Use the exact key names below. Wrong keys silently create a collection with no vector indexes — your batch will show COMPLETED but produce 0 documents.
Key
Required
Description
feature_type
Yes
Must be "embedding"
feature_name
Yes
Name of the vector index
embedding_dim
Yes
Vector dimensionality
distance_metric
Yes
"cosine", "euclid", or "dot"
Multiple vectors are supported — add one entry per embedding your extractor produces.
Declare what kind of real-time inference your extractor provides by setting inference_type in metadata. This lets retriever stages validate that an extractor is compatible with the stage slot.
Control resource allocation by adding compute_profile to your manifest:
compute_profile = { "resource_type": "cpu", # "cpu", "gpu", or "api" "batch_size": 32, # Rows per __call__ (default: 64) "max_concurrency": 4, # Parallel Ray actors (default: 2)}
For API-based or hash-based extractors that don’t need GPU, set resource_type: "cpu" to skip GPU allocation — saves ~3 minutes startup time and costs ~6x less.
Base your image on the Mixpeek engine image to get Ray, FFmpeg, and all SDK helpers:
FROM us-east1-docker.pkg.dev/mixpeek-inference-463103/mixpeek-engine/engine-base:latestRUN apt-get update && apt-get install -y libcustom-sdk ...
Images must be pushed to your org-scoped Artifact Registry repo. GKE Workload Identity handles pull auth. Contact your account team to provision access.
For text blobs: the raw string. For binary blobs: an S3 URL (not raw bytes).
document_id
Unique document ID
object_id
Source object in the bucket
blob_type
"image", "video", "audio", "text"
blob_property
Property name from your bucket schema
mime_type
MIME type (e.g. image/jpeg)
Always read from batch["data"] — not a column named after your blob property. If your bucket has a text property, the content is still in batch["data"].
Compose existing Mixpeek services instead of loading models yourself:
from shared.inference.registry import get_batch_serviceWhisperBatch = get_batch_service("openai/whisper-large-v3-turbo")E5Batch = get_batch_service("intfloat/multilingual-e5-large-instruct")# Use in your pipeline stepsStepDefinition(service_class=E5Batch, resource_type=ResourceType.CPU, config=e5_config)
The return dict must include an "embedding" key — this is what feature_search uses as the query vector. For multi-vector extractors, include additional keys matching your feature_name values.
realtime.py handles embedding only, not retrieval. If your pipeline needs retrieval context (comparing against stored references), configure retriever stages to handle that logic.
The Extractor SDK provides typed base classes that replace bare Python variables with validated, IDE-friendly types. Use these for autocompletion, validation at upload time, and output column checking in the test harness.
setup() runs once on first batch (lazy model loading). process() runs on each batch. prefetch_hf_model() pre-downloads the model to the HF cache to reduce cold start latency.
A complete extractor for domain-specific text processing — chunking documents by paragraph and embedding each chunk with a fine-tuned model. This pattern works for legal briefs, financial filings, medical records, or any text-heavy corpus.
The retriever uses feature_search with your extractor’s feature URI. At query time, the retriever calls your extractor’s realtime.py to embed the query, then searches against the vectors your batch processor produced.
Before uploading, validate and test your extractor locally with the CLI:
# Validate manifest and run security scanner (no API key needed)python scripts/api/extractors.py lint path/to/my_extractor# Run pipeline through Ray Data test harness (no API key needed)python scripts/api/extractors.py test path/to/my_extractor
lint catches common mistakes before upload:
Wrong feature key names (name instead of feature_name)
Missing required fields
Security scanner violations
test runs your processor through real Ray Data map_batches with Arrow serialization — the same path used in production. It validates that output columns match your manifest features.
import os is allowed — only dangerous functions are blocked. Library-internal file I/O (torch.load, transformers.from_pretrained, pd.read_csv) is also fine since the scanner only inspects your extractor’s source code.
Usually wrong features key names in manifest.py. Run python scripts/api/extractors.py lint path/to/my_extractor to validate your manifest keys. Check that your collection has non-empty vector_indexes via GET /v1/collections/{id}. If empty, fix the manifest keys and recreate the collection. See Vector Index Keys.
Batch produces 0 documents
Most common cause: reading from the wrong column. Always use batch["data"], not batch["text"] or other property names. Check Ray logs for [FailureAggregator] entries.
Extractor validation failed
Check validation_errors. Common issues: using subprocess (use run_tool), using open() directly (use library I/O), using eval/exec (use json.loads).
Model loading is slow
Use prefetch_hf_model() in your setup() method to pre-download models to the HF cache. On GKE, HF_HOME points to a shared PVC so models persist across pod restarts. First cold start downloads (~1-2 min); subsequent starts use cache.
Can't access retrieval context in realtime.py
realtime.py handles embedding, not retrieval. Use retriever stages (semantic_search → rerank → agentic_enrich) for retrieval logic.
feature_uri without realtime.py
If your manifest includes a feature_uri, the system expects a corresponding realtime.py. Without it, feature_search queries against that URI will fail. Omit both if you only need batch processing.
Cloudflare blocks Python client
Use requests with a User-Agent header instead of urllib. curl works fine.