Extract Features

Collections

Collections bind a bucket to a feature extractor. When you submit a batch, the engine runs the extractor against each object and produces searchable documents.

curl -X POST "https://api.mixpeek.com/v1/collections" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-embeddings",
    "source": { "type": "bucket", "bucket_ids": ["'$BUCKET_ID'"] },
    "feature_extractor": {
      "feature_extractor_name": "multimodal_extractor",
      "version": "v1",
      "input_mappings": { "image": "hero_image", "text": "product_text" }
    }
  }'

A single object can feed multiple collections — each running a different extractor. Documents retain lineage to the source object via root_object_id. Collection API →

Embedding Task

Instruction-aware embedding models (E5, Gemini) use a task hint to optimize the embedding for a specific downstream use case. Set embedding_task at the collection level so it applies to every task-aware model in the pipeline.

curl -X POST "https://api.mixpeek.com/v1/collections" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-clusters",
    "embedding_task": "clustering",
    "source": { "type": "bucket", "bucket_ids": ["'$BUCKET_ID'"] },
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": { "text": "product_text" }
    }
  }'

Task	Use Case	Default
`retrieval_document`	Search: find documents from queries	Yes
`retrieval_query`	Rare at index time — query-side is automatic	No
`semantic_similarity`	Symmetric comparison (dedup, matching)	No
`classification`	Document categorization pipelines	No
`clustering`	Grouping documents into clusters	No

You almost never need to set this. The default retrieval_document is correct for search, and at query time Mixpeek automatically uses retrieval_query. Only override for clustering, classification, or symmetric similarity.

Non-instruction-aware models (SigLIP, CLIP, Vertex multimodal) ignore this setting.

Feature URIs

Every extracted feature is addressed by a URI that pins it to a specific extractor version:

mixpeek://{extractor_name}@{version}/{output_name}

Examples:

mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding
mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1
mixpeek://face_identity_extractor@v1/insightface__arcface

Feature URIs are referenced by retriever stages, taxonomies, and clustering jobs. They guarantee query-time compatibility with the extraction pipeline — swap the URI, re-embed, everything downstream stays consistent.

Tiered Pipelines

When a batch is submitted, the engine runs a DAG of extractors:

Tier 1 collections process raw objects from the bucket
Tier 2 collections consume Tier 1 documents as input
Each tier waits for dependencies before executing

video → scenes (Tier 1) → faces per scene (Tier 2) → expressions per face (Tier 3)

Collections define the pipeline through their source and feature_extractor configuration. Dependencies are resolved automatically.

Built-in Extractors

Extractor	Modality	Output
Multimodal	Video, image, audio	Vertex AI 1408D embeddings, transcripts, scene descriptions
Text	Text	E5-Large 1024D embeddings
Image	Image	SigLIP 768D embeddings, descriptions, structured extraction
Face Identity	Video, image	ArcFace 512D face embeddings, bounding boxes
Document	PDF, DOCX	Text chunks, OCR, embeddings
Gemini Multi-file	Any	Gemini-powered cross-file analysis
Web Scraper	URLs	Scraped text content + embeddings
Course Content	Video	Lecture segments, slides, transcripts
Passthrough	Any	Forward metadata without extraction

See the full Extractor Reference for configuration details.

Custom Extractors

For extraction logic beyond built-in models, build custom extractors:

pip install mixpeek
mixpeek plugin init my-extractor     # Scaffold from template
mixpeek plugin test my-extractor     # Validate locally
mixpeek plugin publish my-extractor  # Upload and deploy

Custom extractors run on managed infrastructure with access to GPU/CPU resources, HuggingFace models, and LLM services. They support batch processing, real-time endpoints, and custom model loading. See the full extractors guide for manifest format, pipeline hooks, security constraints, and deployment lifecycle. Custom Extractors →

​Collections

​Embedding Task

​Feature URIs

​Tiered Pipelines

​Built-in Extractors

​Custom Extractors

Collections

Embedding Task

Feature URIs

Tiered Pipelines

Built-in Extractors

Custom Extractors