Clusters - Mixpeek

Clusters provide warehouse-native grouping, the multimodal equivalent of SQL GROUP BY.

Create and run clustering jobs

Create: Click New Cluster, select collections, pick vector or attribute clustering, and configure algorithm params. API: Create Cluster.
Execute: Run real-time clustering on the Engine or submit as a job for async processing. API: Execute Clustering and Submit Job.
Inspect: Review centroids, metrics, and members if saved. Download artifacts like parquet paths under Artifacts. API: Get Artifacts.
List/Get/Delete: Manage clustering configurations and results. API: List, Get, Delete.
Stream data: Browse cluster centroids and members directly. API: Stream Data.
Apply enrichment: Attach cluster labels back to a source or target collection at scale. API: Apply Enrichment.

Choosing an algorithm at scale. For collections over ~100K documents, pick a linear algorithm — K-Means, Gaussian Mixture, or Leiden. The density/graph algorithms that build a pairwise distance matrix — HDBSCAN, DBSCAN, Agglomerative, Spectral, OPTICS, Mean Shift — do not scale past 100K and will error on larger datasets (an N×N distance matrix at 1M rows would need ~7,000 GB of RAM). To run one of those on a large collection, set a sample_size to cluster a representative subset instead. The create-cluster wizard surfaces this guidance inline when you choose the algorithm.

Visualization

The cluster scatter plot maps reduced coordinates to position and size:

x, y → point position on the chart
z (when dimension_reduction.components is 3) → dot size, where larger dots represent higher z-values

This depth-cue approach surfaces the third dimension without requiring a full 3D renderer, making it easy to spot structure that would be lost in a flat 2D projection.

Tips

Start with a sample size to validate parameters before full runs.
Use LLM labeling for human-friendly labels when vectors are dense and unlabeled.
Set dimension_reduction.components to 3 to see depth-based sizing in the scatter plot.

Create a cluster job

Choose collections and configure algorithm parameters; optionally set dimensionality reduction.

Execute or submit

Run in real-time or submit as an asynchronous job and track via Tasks.

Inspect and enrich

Review centroids and metrics, then apply enrichment back to collections if desired.

Artifacts such as parquet paths allow downstream analytics and reproducible exploration.

Retrievers Taxonomies

​Create and run clustering jobs

​Visualization

​Tips

Create and run clustering jobs

Visualization

Tips