Execute Raw Inference
Execute raw inference with provider+model or custom plugin.
This endpoint provides direct access to inference services without the retriever framework overhead. Supports two modes:
- Provider + Model: Use standard providers (openai, google, anthropic)
- Custom Plugin: Use your custom inference plugins by inference_name
Supported Providers
- openai: GPT models, embeddings, Whisper transcription
- google: Gemini models, Vertex multimodal embeddings (1408D)
- anthropic: Claude models
Examples
Custom Plugin (by inference_name)
{
"inference_name": "my_text_embedder_1_0_0",
"inputs": {"text": "hello world"},
"parameters": {}
}
Custom Plugin (by feature_uri)
{
"feature_uri": "mixpeek://my_custom_embedder@1.0.0/embedding",
"inputs": {"text": "hello world"},
"parameters": {}
}
Builtin Embedder (by feature_uri)
{
"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
"inputs": {"text": "hello world"},
"parameters": {}
}
Chat Completion
{
"provider": "openai",
"model": "gpt-4o-mini",
"inputs": {"prompts": ["What is AI?"]},
"parameters": {"temperature": 0.7, "max_tokens": 500}
}
Text Embedding (OpenAI)
{
"provider": "openai",
"model": "text-embedding-3-large",
"inputs": {"text": "machine learning"},
"parameters": {}
}
Text Embedding (Google Vertex Multimodal - 1408D)
{
"provider": "google",
"model": "multimodalembedding",
"inputs": {"text": "machine learning"},
"parameters": {}
}
Image Embedding (Google Vertex Multimodal - 1408D)
{
"provider": "google",
"model": "multimodalembedding",
"inputs": {"image_url": "https://example.com/image.jpg"},
"parameters": {}
}
Image Embedding from Base64
{
"provider": "google",
"model": "multimodalembedding",
"inputs": {"image_base64": "<base64-encoded-image>"},
"parameters": {}
}
Video Embedding (Google Vertex Multimodal - 1408D)
{
"provider": "google",
"model": "multimodalembedding",
"inputs": {"video_url": "https://example.com/video.mp4"},
"parameters": {}
}
Video Embedding from Base64
{
"provider": "google",
"model": "multimodalembedding",
"inputs": {"video_base64": "<base64-encoded-video>"},
"parameters": {}
}
Audio Transcription
{
"provider": "openai",
"model": "whisper-1",
"inputs": {"audio_url": "https://example.com/audio.mp3"},
"parameters": {}
}
Vision (Multimodal LLM)
{
"provider": "openai",
"model": "gpt-4o",
"inputs": {
"prompts": ["Describe this image"],
"image_url": "https://example.com/image.jpg"
},
"parameters": {"temperature": 0.5}
}
Args: request: FastAPI request object (populated by middleware) payload: Raw inference request
Returns: Inference response with results and metadata
Raises: 400 Bad Request: Invalid provider, model, or inputs 401 Unauthorized: Missing or invalid API key 429 Too Many Requests: Rate limit exceeded 500 Internal Server Error: Inference execution failed
Headers
Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.
"Bearer YOUR_MIXPEEK_API_KEY"
Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'. Falls back to ?namespace= query parameter if the header is omitted.
"ns_abc123def456"
"production"
"my-namespace"
Body
Request for raw inference without retriever framework.
This endpoint provides direct access to inference services with minimal configuration. Ideal for simple LLM calls, embeddings, transcription, or vision tasks without requiring collection setup or retriever configuration.
You can either use:
provider+modelfor standard providers (openai, google, anthropic)inference_namefor custom plugins
Examples: # Chat completion (provider + model) { "provider": "openai", "model": "gpt-4o-mini", "inputs": {"prompts": ["What is AI?"]}, "parameters": {"temperature": 0.7, "max_tokens": 500} }
# Text embedding (provider + model)
{
"provider": "openai",
"model": "text-embedding-3-large",
"inputs": {"text": "machine learning"},
"parameters": {}
}
# Custom plugin (inference_name)
{
"inference_name": "my_text_embedder_1_0_0",
"inputs": {"text": "hello world"},
"parameters": {}
}
# Audio transcription
{
"provider": "openai",
"model": "whisper-1",
"inputs": {"audio_url": "https://example.com/audio.mp3"},
"parameters": {}
}
# Vision (multimodal)
{
"provider": "openai",
"model": "gpt-4o",
"inputs": {
"prompts": ["Describe this image"],
"image_url": "https://example.com/image.jpg"
},
"parameters": {"temperature": 0.5}
}Model-specific inputs. Chat: {prompts: [str]}, Embeddings: {text: str} or {texts: [str]}, Transcription: {audio_url: str}, Vision: {prompts: [str], image_url: str}
{
"prompts": ["What is the capital of France?"]
}{ "text": "machine learning" }{
"audio_url": "https://example.com/audio.mp3"
}Provider name: openai, google, anthropic (required if inference_name not set)
"openai"
Model identifier specific to the provider (required if inference_name not set)
"gpt-4o-mini"
Custom plugin inference name (alternative to provider+model)
"my_text_embedder_1_0_0"
Feature URI to resolve to inference_name (alternative to inference_name). Format: mixpeek://{extractor}@{version}/{vector_index_name}
"mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
Optional parameters for inference. Common: temperature (float), max_tokens (int), schema (dict for structured output)
{ "max_tokens": 500, "temperature": 0.7 }Enable semantic caching (vCache) for LLM chat operations. When enabled, semantically similar prompts may return cached responses, reducing latency and cost. Only applies to chat/completion models.
Maximum error rate for semantic cache (0.0-1.0). Lower values are more conservative. Default uses system setting (0.02 = 2%).
0 <= x <= 1Response
Successful Response
Response from raw inference.
Returns the inference results along with metadata about the request.
Inference results (structure varies by modality)
Provider that was used
Model that was used
Total inference latency in milliseconds
Token usage statistics (if available)
{
"completion": 120,
"prompt": 15,
"total": 135
}Whether the response was served from semantic cache (vCache)

