Skip to main content
Cluster visualization showing document groupings with centroids and member assignments
Clusters group similar documents using configurable algorithms running on the Engine’s Ray workers. They produce reusable artifacts, optional enrichments, and can even feed new taxonomies. Clustering is warehouse-native grouping: the multimodal equivalent of SQL GROUP BY, operating on embedding space rather than discrete column values.

Workflow Overview

  1. Define cluster (POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy.
  2. Execute – run manually (POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...).
  3. Inspect artifacts – fetch centroids, members, or reduced coordinates (/v1/clusters/{id}/artifacts).
  4. Enrich documents – write cluster_id, labels, and keywords back into collections.
  5. Promote to taxonomy (optional) – convert stable clusters into reference nodes.

Configuration Highlights

SettingDescription
feature_addressesOne or more feature URIs to cluster on (dense, sparse, multi-vector).
algorithmkmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, or optics.
dimension_reductionOptional UMAP / PCA for visualization coordinates.
llm_labelingGenerate cluster labels, summaries, and keywords using configured LLM providers.
hierarchicalEnable to compute parent-child cluster relationships.
sample_sizeRun on a subset before clustering the full dataset.
Example definition:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "product_topics",
    "source_collection_ids": ["col_products"],
    "feature_addresses": ["mixpeek://text_extractor@v1/text_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 50 },
    "dimension_reduction": {
      "method": "umap",
      "components": 2
    },
    "llm_labeling": {
      "provider": "openai_chat_v1",
      "model": "gpt-4o-mini"
    }
  }'

LLM Labeling

LLM labeling generates human-readable names, summaries, and keywords for each cluster by sending representative documents to an LLM. You control exactly which document fields the LLM sees using input mappings — the same system used by retrievers and taxonomies.

Input Mappings

Each input mapping tells the labeling engine how to extract a value from a document payload and pass it to the LLM. You specify:
FieldDescription
input_keyThe key the LLM receives — e.g. text, image_url, video_url, audio_url.
source_typeWhere to pull the value from: payload, blob, or literal.
pathDot-notation path into the document (for payload and blob source types).
overrideStatic value (only used with literal source type).
Without labeling_inputs, the full document payload is serialized as JSON and sent to the LLM. This works but is noisy — input mappings let you send only the fields that matter.

Source Types

Pull a value from any field in the document payload using dot-notation:
{
  "input_key": "text",
  "source_type": "payload",
  "path": "headline"
}
Nested paths work too:
{
  "input_key": "text",
  "source_type": "payload",
  "path": "metadata.description"
}
Multiple text mappings with input_key: "text" are concatenated automatically.
Pull a URL from the document’s stored blobs. Blobs are assets generated during processing — thumbnails, scene frames, source files, etc.
{
  "input_key": "image_url",
  "source_type": "blob",
  "path": "document_blobs.0.url"
}
The path navigates into blob arrays:
  • document_blobs.0.url — first derived blob (e.g. scene thumbnail)
  • source_blobs.0.url — original source file
  • 0.url — first blob from either array (document_blobs checked first)
S3 URLs are automatically presigned so the LLM provider can access them.
Pass a fixed value, useful for adding context or instructions:
{
  "input_key": "text",
  "source_type": "literal",
  "override": "Category: marketing videos"
}

Text-Only Labeling

For collections with text content, map the relevant text fields:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "script_archetypes",
    "source_collection_ids": ["col_ad_scripts"],
    "feature_addresses": ["mixpeek://text_extractor@v1/text_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 20 },
    "llm_labeling": {
      "provider": "openai",
      "model_name": "gpt-4o-mini",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "text", "source_type": "payload", "path": "description" }
        ]
      }
    }
  }'

Multimodal Labeling

Send images or video alongside text for richer labels. Use a vision-capable model like Gemini 2.5 Flash:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "scene_themes",
    "source_collection_ids": ["col_ad_scenes"],
    "feature_addresses": ["mixpeek://multimodal_extractor@v1/multimodal_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 25 },
    "llm_labeling": {
      "provider": "google",
      "model_name": "gemini-2.5-flash-preview-04-17",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "image_url", "source_type": "blob", "path": "document_blobs.0.url" }
        ]
      }
    }
  }'
Use image_url for scene thumbnails and frames. Use video_url for full video clips. The LLM receives both the text and visual content when generating labels.

Custom Prompts and Response Shapes

Override the default labeling prompt for domain-specific terminology:
{
  "llm_labeling": {
    "provider": "openai",
    "model_name": "gpt-4o",
    "custom_prompt": "You are a creative strategist. Analyze these ad clusters and generate labels that describe the creative archetype (e.g. 'UGC Testimonial', 'Problem-Solution Demo'). Return JSON array: [{\"cluster_id\": \"cl_0\", \"label\": \"...\", \"keywords\": [...]}]",
    "labeling_inputs": {
      "input_mappings": [
        { "input_key": "text", "source_type": "payload", "path": "headline" },
        { "input_key": "text", "source_type": "payload", "path": "primary_text" }
      ]
    }
  }
}
Define a custom response_shape to get structured output beyond the default label/keywords:
{
  "llm_labeling": {
    "response_shape": {
      "label": "string",
      "keywords": ["string"],
      "sentiment": "positive | negative | neutral",
      "target_audience": "string"
    }
  }
}

Advanced Settings

SettingDefaultDescription
max_samples_per_cluster5Number of representative documents sent to the LLM per cluster. More samples = better labels but higher cost.
sample_text_max_lengthTruncate text inputs to this character length.
include_summarytrueGenerate a longer summary alongside the label.
include_keywordstrueGenerate keyword tags for each cluster.
use_embedding_dedupfalseDeduplicate similar labels across clusters using embedding similarity.
embedding_similarity_threshold0.8Threshold for dedup — labels above this similarity are merged.
cache_ttl_seconds604800Cache labels for this duration (default: 7 days). Set to 0 to disable.

Execution & Triggers

  • Manual run: POST /v1/clusters/{id}/execute
  • Submit asynchronous job: POST /v1/clusters/{id}/execute/submit
  • Automated triggers: create cron, interval, or event-based triggers under /v1/clusters/triggers. Execution history is accessible via trigger endpoints.
  • Every run yields a run_id, exposes status via /v1/clusters/{id}/executions, and can be monitored through task polling.

Artifacts

ArtifactEndpointContents
Centroids/executions/{run_id}/artifacts?include_centroids=trueCluster ID, centroid vectors, counts, labels, summaries, keywords
Members/executions/{run_id}/artifacts?include_members=truePoint IDs, reduced coordinates (x, y, z), cluster assignment
Streaming data/executions/{run_id}/dataStream centroids and members (Parquet-backed) for visualization
Artifacts are stored as Parquet in S3 for efficient downstream analytics and visualization.

Enrichment

Apply cluster membership back to collections:
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": "run_xyz789",
    "target_collection_id": "col_products_enriched",
    "fields": ["cluster_id", "label", "summary", "keywords"]
  }'
Enrichment writes cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.

Monitoring & Management

  • GET /v1/clusters/{id} – inspect definition, latest run, enrichment status.
  • POST /v1/clusters/list – search and filter cluster definitions.
  • GET /v1/clusters/{id}/executions – view execution history and metrics.
  • DELETE /v1/clusters/{id} – remove obsolete definitions (artifacts remain unless deleted separately).
  • Webhooks notify you when clustering jobs complete; integrate with alerting or automation.

Best Practices

  1. Prototype on samples – tune algorithm parameters using a small sample_size before running at scale.
  2. Automate freshness – use triggers (cron or event-based) to keep clusters aligned with new data.
  3. Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
  4. Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.
  5. Watch metrics – use execution statistics (duration, member counts) to detect drift or parameter issues.
Clustering is the bridge between raw embeddings and structured understanding. Use it to discover themes, power analytics, and bootstrap taxonomies that feed retrieval.