Documentation Index Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
The image extractor generates dense vector embeddings from images using Google’s SigLIP model (768D). Optimized for visual similarity search, product matching, and cross-modal search with text queries. Fast (~50-100ms per image) and cost-effective.
Pipeline Steps
Filter Dataset (if collection_id provided)
Filter to specified collection
Detect Content Types
Sample 100 rows to identify images vs PDFs
PDF Page Expansion (conditional: if PDF content detected)
Render each PDF page at 72 DPI using PyMuPDF
Create separate image for each page
SigLIP Image Embedding Generation
Resize to 224×224 internally
GPU-accelerated inference
Generate 768D visual embeddings
Thumbnail Generation (conditional: if enable_thumbnails=true)
Resize to 640px width at 85% quality
Upload to S3 with optional CDN
Output
Image/page documents with embeddings
Optional thumbnail URLs
When to Use
Use Case Description Image search Find visually similar images in large collections Visual similarity Match products, artwork, or content by appearance Content discovery Recommend similar visual content Cross-modal search Find images using text queries (via SigLIP text encoder) E-commerce Product image search and visual recommendations Stock photo search Media library search by visual content
When NOT to Use
Scenario Recommended Alternative Face recognition face_identity_extractorVideo content multimodal_extractorText-heavy images requiring OCR multimodal_extractor with OCR enabledAudio content audio_extractor
Field Type Required Description imagestring Yes URL or S3 path to image file. Formats: JPEG, PNG, WebP, BMP. Any resolution (resized to 224x224 internally).
{
"image" : "s3://my-bucket/products/laptop-pro.jpg"
}
Input Examples:
Type Example Product image s3://my-bucket/products/laptop-pro.jpgStock photo https://cdn.example.com/photos/sunset-beach.jpgCatalog image s3://catalog/items/SKU-12345.png
Supported Formats : JPEG, PNG, WebP, BMP, GIF (static)
Recommended Resolution : 224x224 or larger (automatically resized)
Max File Size : 10MB recommended
Output Schema
Field Type Description image_extractor_v1_embeddingfloat[768] SigLIP image embedding, L2 normalized processing_time_msnumber Processing time in milliseconds thumbnail_urlstring S3 URL of the thumbnail image (if generated)
{
"image_extractor_v1_embedding" : [ 0.023 , -0.041 , 0.018 , ... ],
"processing_time_ms" : 85.2 ,
"thumbnail_url" : "s3://mixpeek-storage/ns_123/thumbnails/thumb_001.jpg"
}
Parameters
The image extractor uses sensible defaults and requires no additional parameters for basic usage.
Parameter Type Default Description None required - - All parameters use optimized defaults
Configuration Examples
Basic Image Embedding
E-commerce Product Images
Stock Photo Library
Art Collection
{
"feature_extractor" : {
"feature_extractor_name" : "image_extractor" ,
"version" : "v1" ,
"input_mappings" : {
"image" : "payload.image_url"
},
"field_passthrough" : [
{ "source_path" : "metadata.product_id" }
],
"parameters" : {}
}
}
Metric Value Processing speed ~50-100ms per image Batch processing Up to 16 images per batch GPU acceleration Supported for faster inference Cost 2 credits per image
Vector Index
Property Value Index name image_extractor_v1_embeddingDimensions 768 Type Dense Distance metric Cosine Datatype float32 Inference model google_siglip_base_v1
Cross-Modal Search
The SigLIP embeddings are compatible with SigLIP text embeddings, enabling cross-modal search where you can:
Find images using natural language text queries
Match images to text descriptions
Build hybrid search combining visual and textual similarity
Feature image_extractor multimodal_extractor Dimensions 768 1408 Model SigLIP Vertex AI Multimodal Processing Image only Video, Image, Text, GIF Cross-modal SigLIP text encoder Vertex text encoder Best For Fast image search Unified multimodal search Cost 2 credits/image Higher (includes more features)
Limitations
Image only : Does not process video, audio, or text content
No OCR : Cannot extract text from images; use multimodal_extractor with OCR
No face recognition : For face matching, use face_identity_extractor
Single image : Processes one image at a time (batch via API)
Resolution : Input is resized to 224x224 internally