Skip to main content
Cluster visualization showing document groupings with centroids and member assignments
Clusters automatically group documents into meaningful categories. Define what to cluster on, pick an algorithm, execute, and get back labeled groups you can visualize, enrich into collections, or promote to taxonomies.

Two Clustering Types

Mixpeek supports two fundamentally different ways to cluster documents:

Vector (Semantic)

Groups documents by embedding similarity — what they mean, not what metadata they have. Uses vector embeddings from any extractor (text, image, multimodal) and supports 8 algorithms.Best for: topic discovery, content deduplication, visual similarity, finding themes across modalities.

Attribute (Metadata)

Groups documents by metadata field values — like a GROUP BY on structured columns. No embeddings needed; operates directly on payload fields.Best for: categorical grouping, hierarchical organization by brand/category/status, faceted analytics.

Vector Clustering

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "product_topics",
    "collection_ids": ["col_products"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uris": ["mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"],
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
      "preprocessing_steps": [
        { "method": "whitening" },
        { "method": "umap", "n_components": 50, "n_neighbors": 30 }
      ]
    },
    "llm_labeling": {
      "provider": "openai",
      "model_name": "gpt-4o-mini"
    }
  }'

Attribute Clustering

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "product_categories",
    "collection_ids": ["col_products"],
    "cluster_type": "attribute",
    "attribute_config": {
      "attributes": ["category", "brand"],
      "hierarchical_grouping": true
    }
  }'
With hierarchical_grouping: true, this creates nested groups: “Electronics” containing “Apple”, “Samsung”, etc. Without it, you get flat groups like “Electronics_Apple”, “Electronics_Samsung”.

Algorithms

Vector clustering supports 8 algorithms. Pick based on whether you know how many clusters to expect:
AlgorithmBest WhenKey Parameters
HDBSCANYou don’t know the number of clusters. Handles variable density, auto-detects noise.min_cluster_size, min_samples
K-MeansYou know K. Fast, spherical clusters.n_clusters, max_iter
DBSCANYou want density-based grouping with a fixed distance threshold.eps, min_samples
AgglomerativeYou want hierarchical merging with a specific linkage strategy.n_clusters, linkage (ward/complete/average/single)
SpectralClusters have complex, non-convex shapes.n_clusters_spectral
Gaussian MixtureYou need soft (probabilistic) assignments.n_components_gmm
Mean ShiftYou want automatic cluster count via bandwidth-based mode finding.bandwidth params
OPTICSSimilar to DBSCAN but handles varying density better.eps, min_samples
Start with HDBSCAN if you don’t know how many clusters to expect. Use K-Means when you have a target count and want fast results.

Multi-Feature Strategy

When clustering on multiple embeddings (e.g., text + image), choose how to combine them:
StrategyHow It WorksUse When
concatenate (default)Fuses all embeddings into a single vector, then clusters once. Supports per-feature weights.Features are complementary and you want one set of clusters.
independentRuns separate clustering per feature. Produces one output per modality.You want to compare how text clusters vs image clusters differ.
weightedAuto-learns optimal feature weights via Bayesian optimization.You’re not sure which modality matters more — let the algorithm decide.
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "multimodal_themes",
    "collection_ids": ["col_ads"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uris": [
        "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
        "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding"
      ],
      "multi_feature_strategy": "weighted",
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 15 }
    }
  }'

Visualization Dimensions

In Studio, the cluster scatter plot encodes three visual dimensions so you can explore cluster structure at a glance:
Visual DimensionWhat It RepresentsControlled By
Position (X, Y)Semantic proximity — nearby points are similar in embedding spaceFirst two coordinates from dimensionality reduction (UMAP, PCA, or t-SNE)
ColorCluster membership — each cluster gets a distinct colorAutomatic assignment based on cluster ID
Dot sizeDepth (Z axis) — larger dots have higher Z values, creating a depth cueThird coordinate when dimension_reduction.components is set to 3
To enable the size dimension, set 3 components in your dimensionality reduction config:
{
  "dimension_reduction": {
    "method": "umap",
    "components": 3
  }
}
Without the third component, all dots render at the same size. With it, the Z value is linearly mapped to dot radius (10px–50px), so visually prominent points sit “closer” in the third principal axis.

Centroid Methods

Control how cluster centers are calculated:
MethodDescription
mean (default)Average of all member vectors. Smooth, stable centroids.
medianMedian vector. More robust to outliers than mean.
medoidThe actual cluster member closest to the center. Most interpretable — the centroid is a real document.

Preprocessing

High-dimensional embeddings (1408d, 3072d) benefit from preprocessing before density-based algorithms. The preprocessing_steps field accepts an ordered list:
Decorrelates embedding dimensions, removing redundant structure that causes density-based algorithms to over-fragment.
{ "method": "whitening", "regularization": 1e-5 }
Reduces dimensionality while preserving neighborhood structure. Critical for HDBSCAN on high-dimensional data.
{ "method": "umap", "n_components": 50, "n_neighbors": 30, "min_dist": 0.0, "metric": "cosine" }

Execution Modes

ModeWhat It DoesWhen To Use
full (default)Clusters all documents from scratch.Initial run, or when you want fresh clusters.
assignAssigns new documents to existing centroids without re-clustering. O(n×k).Streaming ingestion — run full periodically, assign for new docs in between.
compositeClusters the centroids from prior executions together.Cross-modality comparison, temporal drift detection, parameter tuning.

Incremental Assignment

curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "assign",
    "assignment_threshold": 0.5
  }'
Documents below the assignment_threshold cosine similarity are marked as noise (cluster_id = -1).

Composite Clustering

Clusters centroids from prior runs to reveal higher-order patterns. A run with 10,000 documents and 50 clusters contributes only 50 vectors, so composite execution is fast.
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "composite",
    "source_execution_ids": ["run_abc123", "run_def456"]
  }'

Hierarchical Sub-Clustering

Enable recursive sub-clustering when top-level clusters are too broad. Each cluster with enough members is further divided using UMAP + HDBSCAN.
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "content_hierarchy",
    "collection_ids": ["col_videos"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uris": ["mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding"],
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 50, "min_samples": 10 },
      "hierarchical": true,
      "max_hierarchy_depth": 3
    }
  }'
Sub-clusters get IDs like cl_0_sub_1_sub_0. Each centroid includes parent_cluster_id, child_cluster_ids, and hierarchy_level. Example: “sports” → “basketball” → “NBA highlights”.

Quality Metrics

Every execution returns metrics that tell you whether your clusters are meaningful:
MetricRangeWhat It Means
silhouette_score-1 to 1Cluster separation quality. Above 0.5 is good.
mean_cosine_to_centroid0 to 1Average assignment confidence. Higher = tighter clusters.
noise_ratio0 to 1Fraction classified as noise. High values suggest parameters are too strict.
cluster_size_entropy0 to 1How balanced cluster sizes are. 1.0 = perfectly even.
should_recluster0 or 1Automatic recommendation based on metric thresholds.
Per-member similarity is also tracked:
FieldLevelDescription
cosine_similarity_to_centroidMemberHow well this document fits its cluster (0–1)
mean_cosine_similarityCentroidAverage member similarity (cohesion)
min_cosine_similarityCentroidWeakest member (indicates outliers)

LLM Labeling

Generate human-readable names, summaries, and keywords for each cluster. Control which document fields the LLM sees using input mappings.

Text-Only Labeling

curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "script_archetypes",
    "collection_ids": ["col_ad_scripts"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uris": ["mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"],
      "clustering_method": "kmeans",
      "kmeans_parameters": { "n_clusters": 20 }
    },
    "llm_labeling": {
      "provider": "openai",
      "model_name": "gpt-4o-mini",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "text", "source_type": "payload", "path": "description" }
        ]
      }
    }
  }'

Multimodal Labeling

Send images or video alongside text for richer labels. Use a vision-capable model:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "scene_themes",
    "collection_ids": ["col_ad_scenes"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uris": ["mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding"],
      "clustering_method": "kmeans",
      "kmeans_parameters": { "n_clusters": 25 }
    },
    "llm_labeling": {
      "provider": "google",
      "model_name": "gemini-2.5-flash-preview-04-17",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "image_url", "source_type": "blob", "path": "document_blobs.0.url" }
        ]
      }
    }
  }'

Input Mapping Reference

FieldDescription
input_keyKey the LLM receives: text, image_url, video_url, audio_url
source_typeWhere to pull the value: payload (document fields), blob (stored assets), literal (static value)
pathDot-notation path into the document (for payload and blob)
overrideStatic value (only with literal source type)
Without labeling_inputs, the full document payload is serialized as JSON. Input mappings let you send only the fields that matter.

Custom Prompts and Response Shapes

Override the default prompt for domain-specific labels:
{
  "llm_labeling": {
    "provider": "openai",
    "model_name": "gpt-4o",
    "custom_prompt": "Analyze these ad clusters and label each creative archetype (e.g. 'UGC Testimonial', 'Problem-Solution Demo').",
    "response_shape": {
      "label": "string",
      "keywords": ["string"],
      "sentiment": "positive | negative | neutral",
      "target_audience": "string"
    }
  }
}

Labeling Settings

SettingDefaultDescription
max_samples_per_clusterauto (3–20)Representative documents sent to the LLM per cluster.
sample_text_max_lengthTruncate text inputs to this character length.
include_summarytrueGenerate a longer summary alongside the label.
include_keywordstrueGenerate keyword tags for each cluster.
use_embedding_dedupfalseMerge similar labels across clusters using embedding similarity.
embedding_similarity_threshold0.8Threshold for label dedup.
cache_ttl_seconds604800Cache labels for 7 days. Set to 0 to disable.

Enrichment

Write cluster membership back into your source collections:
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": "run_xyz789",
    "target_collection_id": "col_products_enriched",
    "fields": ["cluster_id", "label", "summary", "keywords"]
  }'
This writes cluster_id and labels into document payloads, enabling cluster-based filters and facets in retrievers.

Execution & Triggers

  • Manual: POST /v1/clusters/{id}/execute
  • Async job: POST /v1/clusters/{id}/execute/submit
  • Automated: create cron, interval, or event-based triggers under /v1/clusters/triggers
  • Every run yields a run_id and exposes status via GET /v1/clusters/{id}/executions

Artifacts

ArtifactEndpointContents
Centroids/executions/{run_id}/artifacts?include_centroids=trueCluster ID, centroid vectors, counts, labels, summaries, keywords
Members/executions/{run_id}/artifacts?include_members=truePoint IDs, reduced coordinates (x, y, z), cluster assignment
Streaming data/executions/{run_id}/dataStream centroids and members for visualization

Management

OperationEndpoint
Inspect definitionGET /v1/clusters/{id}
List clustersPOST /v1/clusters/list
Execution historyGET /v1/clusters/{id}/executions
DeleteDELETE /v1/clusters/{id}

Best Practices

  1. Start with HDBSCAN + whitening + UMAP for vector clustering on high-dimensional embeddings.
  2. Prototype on samples — tune parameters with a small sample_size before running at scale.
  3. Use incremental assignment for streaming datafull periodically, assign for new documents in between.
  4. Monitor quality metrics — check silhouette_score and noise_ratio after each run. Recluster when should_recluster fires.
  5. Use attribute clustering for categorical grouping — don’t force embeddings when metadata fields already capture the structure.
  6. Try multi-feature weighted strategy when combining modalities — let Bayesian optimization find the right blend.
  7. Enable 3-component dimensionality reduction to get the depth (size) dimension in Studio visualizations.

Clusters vs Taxonomies vs Alerts

I want to…Use
Discover what categories exist in my dataClusters
Apply known categories to new documentsTaxonomies
Get notified when something specific appearsAlerts
Turn discovered groups into reusable labelsClusters → promote to taxonomy