Skip to main content
Mixpeek extracts visual embeddings, OCR text, descriptions, and structured metadata from images. Each image becomes a document with dense vector indexes for visual similarity search, text-to-image search, and filtered retrieval.

What Gets Extracted

FeatureModelDimensionsExtractor
Visual embeddings (image-only)SigLIP768Dimage_extractor
Visual embeddings (cross-modal)Vertex AI multimodal1408Dmultimodal_extractor
OCR textGeminimultimodal_extractor
Image descriptionsGeminimultimodal_extractor
Face embeddingsArcFace (SCRFD detect)512Dface_identity_extractor
ThumbnailsFFmpegimage_extractor, multimodal_extractor

Choosing an Extractor

GoalExtractorWhy
Visual similarity search (image-to-image)image_extractorSigLIP 768D embeddings, fast (~50ms/image), supports cross-modal text queries
Cross-modal search (text-to-image, image-to-video)multimodal_extractorVertex AI 1408D unified embedding space across video, image, and text
OCR or image descriptionsmultimodal_extractorGemini-based text extraction and description generation
Face detection and matchingface_identity_extractorArcFace 512D with 99.8% verification accuracy
Structured extraction (products, labels)multimodal_extractor with response_shapeLLM extracts structured JSON from image content
Use image_extractor when you only need image search. Use multimodal_extractor when you need images searchable alongside video or text in the same embedding space.

Create a Collection for Images

This collection generates SigLIP embeddings and thumbnails for an image catalog.
curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-images",
    "source": { "type": "bucket", "bucket_id": "bkt_products" },
    "feature_extractor": {
      "feature_extractor_name": "image_extractor",
      "version": "v1",
      "input_mappings": {
        "image": "payload.image_url"
      },
      "field_passthrough": [
        { "source_path": "metadata.product_id" },
        { "source_path": "metadata.brand" },
        { "source_path": "metadata.category" }
      ],
      "parameters": {
        "enable_thumbnails": true
      }
    }
  }'
Create a retriever and execute it with a text query. SigLIP’s shared text-image embedding space lets you search images with natural language.
curl -X POST https://api.mixpeek.com/v1/retrievers \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "retriever_name": "image-search",
    "collection_ids": ["col_product_images"],
    "input_schema": {
      "properties": {
        "query": { "type": "text", "required": true }
      }
    },
    "stages": [
      {
        "stage_name": "visual_search",
        "stage_type": "filter",
        "config": {
          "stage_id": "feature_search",
          "parameters": {
            "query": "{{INPUT.query}}",
            "top_k": 20
          }
        }
      }
    ]
  }'
Execute a text-to-image search:
curl -X POST https://api.mixpeek.com/v1/retrievers/ret_abc123/execute \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": { "query": "red leather handbag with gold buckle" },
    "limit": 10
  }'

Structured Extraction from Images

Use multimodal_extractor with response_shape to extract structured product metadata from images.
curl -X POST https://api.mixpeek.com/v1/collections \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-catalog-enriched",
    "source": { "type": "bucket", "bucket_id": "bkt_products" },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": {
        "image": "payload.image_url"
      },
      "parameters": {
        "run_multimodal_embedding": true,
        "run_ocr": true,
        "run_video_description": true,
        "description_prompt": "Describe the product in this image including color, material, and style.",
        "response_shape": {
          "type": "object",
          "properties": {
            "product_type": { "type": "string" },
            "color": { "type": "string" },
            "material": { "type": "string" },
            "brand_visible": { "type": "boolean" },
            "text_on_product": { "type": "string" }
          }
        }
      }
    }
  }'

Output Schema

Each image produces a document like this:
{
  "document_id": "doc_img_456",
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/product_001.jpg",
  "metadata": {
    "product_id": "SKU-12345",
    "brand": "Acme",
    "category": "accessories"
  },
  "image_extractor_v1_embedding": [0.045, -0.012, "...768 floats"]
}
When using multimodal_extractor with descriptions and OCR:
{
  "document_id": "doc_img_789",
  "description": "Red leather handbag with gold buckle closure, front pocket with magnetic snap",
  "ocr_text": "ACME LEATHER CO.",
  "product_type": "handbag",
  "color": "red",
  "material": "leather",
  "brand_visible": true,
  "text_on_product": "ACME LEATHER CO.",
  "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, "...1408 floats"]
}
FieldTypeDescription
image_extractor_v1_embeddingfloat[768]SigLIP visual embedding
multimodal_extractor_v1_multimodal_embeddingfloat[1408]Vertex AI cross-modal embedding
descriptionstringGemini-generated image description
ocr_textstringText extracted from the image
thumbnail_urlstringS3 URL of resized thumbnail (640px width)
response_shape fieldsvariesStructured fields from LLM extraction