Two Clustering Types
Mixpeek supports two fundamentally different ways to cluster documents:Vector (Semantic)
Groups documents by embedding similarity — what they mean, not what metadata they have. Uses vector embeddings from any extractor (text, image, multimodal) and supports 8 algorithms.Best for: topic discovery, content deduplication, visual similarity, finding themes across modalities.
Attribute (Metadata)
Groups documents by metadata field values — like a
GROUP BY on structured columns. No embeddings needed; operates directly on payload fields.Best for: categorical grouping, hierarchical organization by brand/category/status, faceted analytics.Vector Clustering
Attribute Clustering
hierarchical_grouping: true, this creates nested groups: “Electronics” containing “Apple”, “Samsung”, etc. Without it, you get flat groups like “Electronics_Apple”, “Electronics_Samsung”.
Algorithms
Vector clustering supports 8 algorithms. Pick based on whether you know how many clusters to expect:| Algorithm | Best When | Key Parameters |
|---|---|---|
| HDBSCAN | You don’t know the number of clusters. Handles variable density, auto-detects noise. | min_cluster_size, min_samples |
| K-Means | You know K. Fast, spherical clusters. | n_clusters, max_iter |
| DBSCAN | You want density-based grouping with a fixed distance threshold. | eps, min_samples |
| Agglomerative | You want hierarchical merging with a specific linkage strategy. | n_clusters, linkage (ward/complete/average/single) |
| Spectral | Clusters have complex, non-convex shapes. | n_clusters_spectral |
| Gaussian Mixture | You need soft (probabilistic) assignments. | n_components_gmm |
| Mean Shift | You want automatic cluster count via bandwidth-based mode finding. | bandwidth params |
| OPTICS | Similar to DBSCAN but handles varying density better. | eps, min_samples |
Multi-Feature Strategy
When clustering on multiple embeddings (e.g., text + image), choose how to combine them:| Strategy | How It Works | Use When |
|---|---|---|
concatenate (default) | Fuses all embeddings into a single vector, then clusters once. Supports per-feature weights. | Features are complementary and you want one set of clusters. |
independent | Runs separate clustering per feature. Produces one output per modality. | You want to compare how text clusters vs image clusters differ. |
weighted | Auto-learns optimal feature weights via Bayesian optimization. | You’re not sure which modality matters more — let the algorithm decide. |
Visualization Dimensions
In Studio, the cluster scatter plot encodes three visual dimensions so you can explore cluster structure at a glance:| Visual Dimension | What It Represents | Controlled By |
|---|---|---|
| Position (X, Y) | Semantic proximity — nearby points are similar in embedding space | First two coordinates from dimensionality reduction (UMAP, PCA, or t-SNE) |
| Color | Cluster membership — each cluster gets a distinct color | Automatic assignment based on cluster ID |
| Dot size | Depth (Z axis) — larger dots have higher Z values, creating a depth cue | Third coordinate when dimension_reduction.components is set to 3 |
Without the third component, all dots render at the same size. With it, the Z value is linearly mapped to dot radius (10px–50px), so visually prominent points sit “closer” in the third principal axis.
Centroid Methods
Control how cluster centers are calculated:| Method | Description |
|---|---|
mean (default) | Average of all member vectors. Smooth, stable centroids. |
median | Median vector. More robust to outliers than mean. |
medoid | The actual cluster member closest to the center. Most interpretable — the centroid is a real document. |
Preprocessing
High-dimensional embeddings (1408d, 3072d) benefit from preprocessing before density-based algorithms. Thepreprocessing_steps field accepts an ordered list:
Whitening (ZCA)
Whitening (ZCA)
Decorrelates embedding dimensions, removing redundant structure that causes density-based algorithms to over-fragment.
UMAP reduction
UMAP reduction
Reduces dimensionality while preserving neighborhood structure. Critical for HDBSCAN on high-dimensional data.
Chained (recommended for high-dim)
Chained (recommended for high-dim)
Whitening + UMAP together typically improves HDBSCAN cluster purity by 15–30% on embeddings above 1000 dimensions.
Execution Modes
| Mode | What It Does | When To Use |
|---|---|---|
full (default) | Clusters all documents from scratch. | Initial run, or when you want fresh clusters. |
assign | Assigns new documents to existing centroids without re-clustering. O(n×k). | Streaming ingestion — run full periodically, assign for new docs in between. |
composite | Clusters the centroids from prior executions together. | Cross-modality comparison, temporal drift detection, parameter tuning. |
Incremental Assignment
assignment_threshold cosine similarity are marked as noise (cluster_id = -1).
Composite Clustering
Clusters centroids from prior runs to reveal higher-order patterns. A run with 10,000 documents and 50 clusters contributes only 50 vectors, so composite execution is fast.Hierarchical Sub-Clustering
Enable recursive sub-clustering when top-level clusters are too broad. Each cluster with enough members is further divided using UMAP + HDBSCAN.cl_0_sub_1_sub_0. Each centroid includes parent_cluster_id, child_cluster_ids, and hierarchy_level. Example: “sports” → “basketball” → “NBA highlights”.
Quality Metrics
Every execution returns metrics that tell you whether your clusters are meaningful:| Metric | Range | What It Means |
|---|---|---|
silhouette_score | -1 to 1 | Cluster separation quality. Above 0.5 is good. |
mean_cosine_to_centroid | 0 to 1 | Average assignment confidence. Higher = tighter clusters. |
noise_ratio | 0 to 1 | Fraction classified as noise. High values suggest parameters are too strict. |
cluster_size_entropy | 0 to 1 | How balanced cluster sizes are. 1.0 = perfectly even. |
should_recluster | 0 or 1 | Automatic recommendation based on metric thresholds. |
| Field | Level | Description |
|---|---|---|
cosine_similarity_to_centroid | Member | How well this document fits its cluster (0–1) |
mean_cosine_similarity | Centroid | Average member similarity (cohesion) |
min_cosine_similarity | Centroid | Weakest member (indicates outliers) |
LLM Labeling
Generate human-readable names, summaries, and keywords for each cluster. Control which document fields the LLM sees using input mappings.Text-Only Labeling
Multimodal Labeling
Send images or video alongside text for richer labels. Use a vision-capable model:Input Mapping Reference
| Field | Description |
|---|---|
input_key | Key the LLM receives: text, image_url, video_url, audio_url |
source_type | Where to pull the value: payload (document fields), blob (stored assets), literal (static value) |
path | Dot-notation path into the document (for payload and blob) |
override | Static value (only with literal source type) |
Without
labeling_inputs, the full document payload is serialized as JSON. Input mappings let you send only the fields that matter.Custom Prompts and Response Shapes
Override the default prompt for domain-specific labels:Labeling Settings
| Setting | Default | Description |
|---|---|---|
max_samples_per_cluster | auto (3–20) | Representative documents sent to the LLM per cluster. |
sample_text_max_length | — | Truncate text inputs to this character length. |
include_summary | true | Generate a longer summary alongside the label. |
include_keywords | true | Generate keyword tags for each cluster. |
use_embedding_dedup | false | Merge similar labels across clusters using embedding similarity. |
embedding_similarity_threshold | 0.8 | Threshold for label dedup. |
cache_ttl_seconds | 604800 | Cache labels for 7 days. Set to 0 to disable. |
Enrichment
Write cluster membership back into your source collections:cluster_id and labels into document payloads, enabling cluster-based filters and facets in retrievers.
Execution & Triggers
- Manual:
POST /v1/clusters/{id}/execute - Async job:
POST /v1/clusters/{id}/execute/submit - Automated: create cron, interval, or event-based triggers under
/v1/clusters/triggers - Every run yields a
run_idand exposes status viaGET /v1/clusters/{id}/executions
Artifacts
| Artifact | Endpoint | Contents |
|---|---|---|
| Centroids | /executions/{run_id}/artifacts?include_centroids=true | Cluster ID, centroid vectors, counts, labels, summaries, keywords |
| Members | /executions/{run_id}/artifacts?include_members=true | Point IDs, reduced coordinates (x, y, z), cluster assignment |
| Streaming data | /executions/{run_id}/data | Stream centroids and members for visualization |
Management
| Operation | Endpoint |
|---|---|
| Inspect definition | GET /v1/clusters/{id} |
| List clusters | POST /v1/clusters/list |
| Execution history | GET /v1/clusters/{id}/executions |
| Delete | DELETE /v1/clusters/{id} |
Best Practices
- Start with HDBSCAN + whitening + UMAP for vector clustering on high-dimensional embeddings.
- Prototype on samples — tune parameters with a small
sample_sizebefore running at scale. - Use incremental assignment for streaming data —
fullperiodically,assignfor new documents in between. - Monitor quality metrics — check
silhouette_scoreandnoise_ratioafter each run. Recluster whenshould_reclusterfires. - Use attribute clustering for categorical grouping — don’t force embeddings when metadata fields already capture the structure.
- Try multi-feature
weightedstrategy when combining modalities — let Bayesian optimization find the right blend. - Enable 3-component dimensionality reduction to get the depth (size) dimension in Studio visualizations.
Clusters vs Taxonomies vs Alerts
| I want to… | Use |
|---|---|
| Discover what categories exist in my data | Clusters |
| Apply known categories to new documents | Taxonomies |
| Get notified when something specific appears | Alerts |
| Turn discovered groups into reusable labels | Clusters → promote to taxonomy |

