Clusters - Mixpeek

Cluster visualization showing document groupings with centroids and member assignments

Clusters group similar documents using configurable algorithms running on the Engine’s Ray workers. They produce reusable artifacts, optional enrichments, and can even feed new taxonomies. Clustering is warehouse-native grouping: the multimodal equivalent of SQL GROUP BY, operating on embedding space rather than discrete column values.

Workflow Overview

Define cluster (POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy.
Execute – run manually (POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...).
Inspect artifacts – fetch centroids, members, or reduced coordinates (/v1/clusters/{id}/artifacts).
Enrich documents – write cluster_id, labels, and keywords back into collections.
Promote to taxonomy (optional) – convert stable clusters into reference nodes.

Configuration Highlights

Setting	Description
`feature_addresses`	One or more feature URIs to cluster on (dense, sparse, multi-vector).
`algorithm`	`kmeans`, `dbscan`, `hdbscan`, `agglomerative`, `spectral`, `gaussian_mixture`, `mean_shift`, or `optics`.
`preprocessing_steps`	Ordered preprocessing before clustering: whitening, UMAP reduction, or both chained.
`hierarchical`	Enable recursive sub-clustering within each cluster for multi-level grouping.
`llm_labeling`	Generate cluster labels, summaries, and keywords using configured LLM providers.
`sample_size`	Run on a subset before clustering the full dataset.

Example definition:

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "product_topics",
    "collection_ids": ["col_products"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uri": "mixpeek://text_extractor@v1/text_embedding",
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
      "preprocessing_steps": [
        { "method": "whitening" },
        { "method": "umap", "n_components": 50, "n_neighbors": 30 }
      ]
    },
    "llm_labeling": {
      "enabled": true,
      "provider": "openai",
      "model_name": "gpt-4o-mini"
    }
  }'

Preprocessing Steps

High-dimensional embeddings (1408d, 3072d) benefit from preprocessing before density-based algorithms like HDBSCAN. The preprocessing_steps field accepts an ordered list of transformations applied before clustering.

Embedding Whitening (Soft-ZCA)

Whitening decorrelates embedding dimensions, removing redundant structure that causes density-based algorithms to over-fragment clusters. Uses ZCA whitening to preserve the original coordinate space.

{
  "preprocessing_steps": [
    { "method": "whitening", "regularization": 1e-5 }
  ]
}

UMAP Pre-Reduction

UMAP reduces dimensionality while preserving local neighborhood structure — critical for HDBSCAN which suffers from the curse of dimensionality on raw embeddings. Defaults are optimized for clustering: 50 components, cosine metric, 30 neighbors, 0.0 min_dist.

{
  "preprocessing_steps": [
    { "method": "umap", "n_components": 50, "n_neighbors": 30, "min_dist": 0.0, "metric": "cosine" }
  ]
}

Chained Preprocessing

Chain whitening and UMAP together for best results on high-dimensional embeddings. Steps execute in order.

{
  "preprocessing_steps": [
    { "method": "whitening", "regularization": 1e-5 },
    { "method": "umap", "n_components": 50, "n_neighbors": 30, "min_dist": 0.0 }
  ]
}

For raw video/image embeddings above 1000 dimensions, the whitening + UMAP chain typically improves HDBSCAN cluster purity by 15-30% compared to clustering on raw embeddings.

Cosine Similarity Metrics

Every cluster execution automatically computes cosine similarity between each member and its assigned centroid. These metrics appear on both member documents and centroids:

Field	Level	Description
`cosine_similarity_to_centroid`	Member	How well this document fits its cluster (0-1, higher is better)
`mean_cosine_similarity`	Centroid	Average member similarity (cluster cohesion)
`min_cosine_similarity`	Centroid	Weakest member similarity (indicates outliers)

Use these to identify borderline assignments, filter low-confidence members, or set quality thresholds for recluster triggers.

Incremental Assignment

After an initial clustering run, assign new documents to existing clusters without re-running the full algorithm. Pass mode: "assign" when executing:

curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "assign",
    "assignment_threshold": 0.5
  }'

Parameter	Default	Description
`mode`	`"full"`	`"full"` re-clusters from scratch, `"assign"` uses existing centroids
`assignment_threshold`	`0.5`	Minimum cosine similarity to assign to a cluster. Below this, the document is marked as noise (`cluster_id = -1`).

Assign mode loads centroids from the most recent successful execution. Documents are compared to every centroid using cosine similarity and assigned to the closest match above the threshold. This is O(n*k) — fast enough for continuous ingestion.

Composite Clustering

Composite mode clusters the centroids from two or more prior executions together, revealing cross-execution patterns. Instead of operating on raw documents, it takes the centroid vectors produced by previous runs and groups them into higher-order clusters. Use cases:

Cross-extractor comparison — cluster centroids from a text embedding run and an image embedding run to find themes that appear in both modalities.
Temporal drift detection — cluster centroids from weekly runs to see how topic groups evolve over time.
Parameter tuning — compare results from executions with different algorithms or k values to identify stable groupings that persist across configurations.

First, run two or more full executions, then pass their run_id values to a composite execute call:

# Step 1: Execute two full clustering runs (assumes cluster already exists)
# run_id values are returned in each response

# Step 2: Composite execution using both run IDs
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "composite",
    "source_execution_ids": ["run_abc123", "run_def456"]
  }'

Parameter	Description
`mode`	Set to `"composite"` to cluster centroids from prior runs.
`source_execution_ids`	List of two or more `run_id` values from previous successful executions.

Composite mode operates on centroids, not raw documents — a run with 10,000 documents and 50 clusters contributes only 50 vectors. This makes composite execution fast even when the underlying datasets are large.

Hierarchical Sub-Clustering

Enable recursive sub-clustering to discover structure within clusters. Each cluster with enough members is further divided using UMAP + HDBSCAN, producing a hierarchy of nested cluster IDs.

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "content_hierarchy",
    "collection_ids": ["col_videos"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 50, "min_samples": 10 },
      "hierarchical": true,
      "max_hierarchy_depth": 3
    }
  }'

Sub-clusters get IDs like cl_0_sub_1_sub_0, and each centroid includes parent_cluster_id, child_cluster_ids, and hierarchy_level fields. Use this for large collections where top-level clusters are too broad (e.g., “sports” → “basketball” → “NBA highlights”).

Quality Metrics

Every execution returns quality metrics that measure cluster separation and cohesion:

Metric	Range	Meaning
`silhouette_score`	-1 to 1	Overall cluster separation quality. Above 0.5 is good.
`mean_cosine_to_centroid`	0 to 1	Average assignment confidence across all members.
`noise_ratio`	0 to 1	Fraction of documents classified as noise.
`cluster_size_entropy`	0 to 1	Normalized entropy of cluster sizes (1.0 = perfectly balanced).
`should_recluster`	0 or 1	Automatic recommendation based on metric thresholds.

These are available in the execution results via GET /v1/clusters/{id}/executions. Use them to set up monitoring or trigger reclustering when quality degrades.

LLM Labeling

LLM labeling generates human-readable names, summaries, and keywords for each cluster by sending representative documents to an LLM. You control exactly which document fields the LLM sees using input mappings — the same system used by retrievers and taxonomies.

Input Mappings

Each input mapping tells the labeling engine how to extract a value from a document payload and pass it to the LLM. You specify:

Field	Description
`input_key`	The key the LLM receives — e.g. `text`, `image_url`, `video_url`, `audio_url`.
`source_type`	Where to pull the value from: `payload`, `blob`, or `literal`.
`path`	Dot-notation path into the document (for `payload` and `blob` source types).
`override`	Static value (only used with `literal` source type).

Without labeling_inputs, the full document payload is serialized as JSON and sent to the LLM. This works but is noisy — input mappings let you send only the fields that matter.

Source Types

payload — Read from document fields

Pull a value from any field in the document payload using dot-notation:

{
  "input_key": "text",
  "source_type": "payload",
  "path": "headline"
}

Nested paths work too:

{
  "input_key": "text",
  "source_type": "payload",
  "path": "metadata.description"
}

Multiple text mappings with input_key: "text" are concatenated automatically.

blob — Read from document blobs (images, videos, audio)

Pull a URL from the document’s stored blobs. Blobs are assets generated during processing — thumbnails, scene frames, source files, etc.

{
  "input_key": "image_url",
  "source_type": "blob",
  "path": "document_blobs.0.url"
}

The path navigates into blob arrays:

document_blobs.0.url — first derived blob (e.g. scene thumbnail)
source_blobs.0.url — original source file
0.url — first blob from either array (document_blobs checked first)

S3 URLs are automatically presigned so the LLM provider can access them.

literal — Static value

Pass a fixed value, useful for adding context or instructions:

{
  "input_key": "text",
  "source_type": "literal",
  "override": "Category: marketing videos"
}

Text-Only Labeling

For collections with text content, map the relevant text fields:

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "script_archetypes",
    "source_collection_ids": ["col_ad_scripts"],
    "feature_addresses": ["mixpeek://text_extractor@v1/text_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 20 },
    "llm_labeling": {
      "provider": "openai",
      "model_name": "gpt-4o-mini",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "text", "source_type": "payload", "path": "description" }
        ]
      }
    }
  }'

Multimodal Labeling

Send images or video alongside text for richer labels. Use a vision-capable model like Gemini 2.5 Flash:

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "scene_themes",
    "source_collection_ids": ["col_ad_scenes"],
    "feature_addresses": ["mixpeek://multimodal_extractor@v1/multimodal_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 25 },
    "llm_labeling": {
      "provider": "google",
      "model_name": "gemini-2.5-flash-preview-04-17",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "image_url", "source_type": "blob", "path": "document_blobs.0.url" }
        ]
      }
    }
  }'

Use image_url for scene thumbnails and frames. Use video_url for full video clips. The LLM receives both the text and visual content when generating labels.

Custom Prompts and Response Shapes

Override the default labeling prompt for domain-specific terminology:

{
  "llm_labeling": {
    "provider": "openai",
    "model_name": "gpt-4o",
    "custom_prompt": "You are a creative strategist. Analyze these ad clusters and generate labels that describe the creative archetype (e.g. 'UGC Testimonial', 'Problem-Solution Demo'). Return JSON array: [{\"cluster_id\": \"cl_0\", \"label\": \"...\", \"keywords\": [...]}]",
    "labeling_inputs": {
      "input_mappings": [
        { "input_key": "text", "source_type": "payload", "path": "headline" },
        { "input_key": "text", "source_type": "payload", "path": "primary_text" }
      ]
    }
  }
}

Define a custom response_shape to get structured output beyond the default label/keywords:

{
  "llm_labeling": {
    "response_shape": {
      "label": "string",
      "keywords": ["string"],
      "sentiment": "positive | negative | neutral",
      "target_audience": "string"
    }
  }
}

Advanced Settings

Setting	Default	Description
`max_samples_per_cluster`	`null` (auto)	Representative documents sent to the LLM per cluster. When null, automatically scales with cluster size and spread (3–20). Set an integer to override.
`sample_text_max_length`	—	Truncate text inputs to this character length.
`include_summary`	`true`	Generate a longer summary alongside the label.
`include_keywords`	`true`	Generate keyword tags for each cluster.
`use_embedding_dedup`	`false`	Deduplicate similar labels across clusters using embedding similarity.
`embedding_similarity_threshold`	0.8	Threshold for dedup — labels above this similarity are merged.
`cache_ttl_seconds`	604800	Cache labels for this duration (default: 7 days). Set to 0 to disable.

Execution & Triggers

Manual run: POST /v1/clusters/{id}/execute
Submit asynchronous job: POST /v1/clusters/{id}/execute/submit
Automated triggers: create cron, interval, or event-based triggers under /v1/clusters/triggers. Execution history is accessible via trigger endpoints.
Every run yields a run_id, exposes status via /v1/clusters/{id}/executions, and can be monitored through task polling.

Artifacts

Artifact	Endpoint	Contents
Centroids	`/executions/{run_id}/artifacts?include_centroids=true`	Cluster ID, centroid vectors, counts, labels, summaries, keywords
Members	`/executions/{run_id}/artifacts?include_members=true`	Point IDs, reduced coordinates (`x`, `y`, `z`), cluster assignment
Streaming data	`/executions/{run_id}/data`	Stream centroids and members (Parquet-backed) for visualization

Artifacts are stored as Parquet in S3 for efficient downstream analytics and visualization.

3D Visualization

When dimension_reduction.components is set to 3, the API returns three coordinates per member (x, y, z). In Studio, the third dimension is rendered as dot size — points with a higher z value appear larger, creating a depth cue on the 2D scatter plot. This avoids the complexity of a full 3D renderer while still surfacing the additional structure captured by the third principal component or UMAP axis.

{
  "dimension_reduction": {
    "method": "umap",
    "components": 3
  }
}

Enrichment

Apply cluster membership back to collections:

curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": "run_xyz789",
    "target_collection_id": "col_products_enriched",
    "fields": ["cluster_id", "label", "summary", "keywords"]
  }'

Enrichment writes cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.

Monitoring & Management

GET /v1/clusters/{id} – inspect definition, latest run, enrichment status.
POST /v1/clusters/list – search and filter cluster definitions.
GET /v1/clusters/{id}/executions – view execution history and metrics.
DELETE /v1/clusters/{id} – remove obsolete definitions (artifacts remain unless deleted separately).
Webhooks notify you when clustering jobs complete; integrate with alerting or automation.

Best Practices

Preprocess high-dimensional embeddings – use preprocessing_steps with whitening + UMAP for embeddings above 512 dimensions. This dramatically improves HDBSCAN cluster quality.
Prototype on samples – tune algorithm parameters using a small sample_size before running at scale.
Use incremental assignment for streaming data – run mode: "full" periodically, then mode: "assign" for new documents between full runs.
Monitor quality metrics – check silhouette_score, noise_ratio, and mean_cosine_to_centroid after each execution. Recluster when should_recluster fires.
Use hierarchical clustering for large collections – enable hierarchical: true when top-level clusters are too broad to be actionable.
Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.

Clustering is the bridge between raw embeddings and structured understanding. Use it to discover themes, power analytics, and bootstrap taxonomies that feed retrieval.

​Workflow Overview

​Configuration Highlights

​Preprocessing Steps

​Embedding Whitening (Soft-ZCA)

​UMAP Pre-Reduction

​Chained Preprocessing

​Cosine Similarity Metrics

​Incremental Assignment

​Composite Clustering

​Hierarchical Sub-Clustering

​Quality Metrics

​LLM Labeling

​Input Mappings

​Source Types

​Text-Only Labeling

​Multimodal Labeling

​Custom Prompts and Response Shapes

​Advanced Settings

​Execution & Triggers

​Artifacts

​3D Visualization

​Enrichment

​Monitoring & Management

​Best Practices

Workflow Overview

Configuration Highlights

Preprocessing Steps

Embedding Whitening (Soft-ZCA)

UMAP Pre-Reduction

Chained Preprocessing

Cosine Similarity Metrics

Incremental Assignment

Composite Clustering

Hierarchical Sub-Clustering

Quality Metrics

LLM Labeling

Input Mappings

Source Types

Text-Only Labeling

Multimodal Labeling

Custom Prompts and Response Shapes

Advanced Settings

Execution & Triggers

Artifacts

3D Visualization

Enrichment

Monitoring & Management

Best Practices