Mixpeek extracts visual embeddings, OCR text, descriptions, and structured metadata from images. Each image becomes a document with dense vector indexes for visual similarity search, text-to-image search, and filtered retrieval.
| Feature | Model | Dimensions | Extractor |
|---|
| Visual embeddings (image-only) | SigLIP | 768D | image_extractor |
| Visual embeddings (cross-modal) | Vertex AI multimodal | 1408D | multimodal_extractor |
| OCR text | Gemini | — | multimodal_extractor |
| Image descriptions | Gemini | — | multimodal_extractor |
| Face embeddings | ArcFace (SCRFD detect) | 512D | face_identity_extractor |
| Thumbnails | FFmpeg | — | image_extractor, multimodal_extractor |
| Goal | Extractor | Why |
|---|
| Visual similarity search (image-to-image) | image_extractor | SigLIP 768D embeddings, fast (~50ms/image), supports cross-modal text queries |
| Cross-modal search (text-to-image, image-to-video) | multimodal_extractor | Vertex AI 1408D unified embedding space across video, image, and text |
| OCR or image descriptions | multimodal_extractor | Gemini-based text extraction and description generation |
| Face detection and matching | face_identity_extractor | ArcFace 512D with 99.8% verification accuracy |
| Structured extraction (products, labels) | multimodal_extractor with response_shape | LLM extracts structured JSON from image content |
Use image_extractor when you only need image search. Use multimodal_extractor when you need images searchable alongside video or text in the same embedding space.
Create a Collection for Images
This collection generates SigLIP embeddings and thumbnails for an image catalog.
curl -X POST https://api.mixpeek.com/v1/collections \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"collection_name": "product-images",
"source": { "type": "bucket", "bucket_id": "bkt_products" },
"feature_extractor": {
"feature_extractor_name": "image_extractor",
"version": "v1",
"input_mappings": {
"image": "payload.image_url"
},
"field_passthrough": [
{ "source_path": "metadata.product_id" },
{ "source_path": "metadata.brand" },
{ "source_path": "metadata.category" }
],
"parameters": {
"enable_thumbnails": true
}
}
}'
Reverse Image Search
Create a retriever and execute it with a text query. SigLIP’s shared text-image embedding space lets you search images with natural language.
curl -X POST https://api.mixpeek.com/v1/retrievers \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"retriever_name": "image-search",
"collection_ids": ["col_product_images"],
"input_schema": {
"properties": {
"query": { "type": "text", "required": true }
}
},
"stages": [
{
"stage_name": "visual_search",
"stage_type": "filter",
"config": {
"stage_id": "feature_search",
"parameters": {
"query": "{{INPUT.query}}",
"top_k": 20
}
}
}
]
}'
Execute a text-to-image search:
curl -X POST https://api.mixpeek.com/v1/retrievers/ret_abc123/execute \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"inputs": { "query": "red leather handbag with gold buckle" },
"limit": 10
}'
Use multimodal_extractor with response_shape to extract structured product metadata from images.
curl -X POST https://api.mixpeek.com/v1/collections \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-H "X-Namespace: $NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"collection_name": "product-catalog-enriched",
"source": { "type": "bucket", "bucket_id": "bkt_products" },
"feature_extractor": {
"feature_extractor_name": "multimodal_extractor",
"version": "v1",
"input_mappings": {
"image": "payload.image_url"
},
"parameters": {
"run_multimodal_embedding": true,
"run_ocr": true,
"run_video_description": true,
"description_prompt": "Describe the product in this image including color, material, and style.",
"response_shape": {
"type": "object",
"properties": {
"product_type": { "type": "string" },
"color": { "type": "string" },
"material": { "type": "string" },
"brand_visible": { "type": "boolean" },
"text_on_product": { "type": "string" }
}
}
}
}
}'
Output Schema
Each image produces a document like this:
{
"document_id": "doc_img_456",
"thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/product_001.jpg",
"metadata": {
"product_id": "SKU-12345",
"brand": "Acme",
"category": "accessories"
},
"image_extractor_v1_embedding": [0.045, -0.012, "...768 floats"]
}
When using multimodal_extractor with descriptions and OCR:
{
"document_id": "doc_img_789",
"description": "Red leather handbag with gold buckle closure, front pocket with magnetic snap",
"ocr_text": "ACME LEATHER CO.",
"product_type": "handbag",
"color": "red",
"material": "leather",
"brand_visible": true,
"text_on_product": "ACME LEATHER CO.",
"multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"]
}
| Field | Type | Description |
|---|
image_extractor_v1_embedding | float[768] | SigLIP visual embedding |
multimodal_extractor_v1_multimodal_embedding | float[1408] | Vertex AI cross-modal embedding |
description | string | Gemini-generated image description |
ocr_text | string | Text extracted from the image |
thumbnail_url | string | S3 URL of resized thumbnail (640px width) |
response_shape fields | varies | Structured fields from LLM extraction |