Documentation Index
Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Workflow Overview
- Define cluster (
POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy. - Execute – run manually (
POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...). - Inspect artifacts – fetch centroids, members, or reduced coordinates (
/v1/clusters/{id}/artifacts). - Enrich documents – write
cluster_id, labels, and keywords back into collections. - Promote to taxonomy (optional) – convert stable clusters into reference nodes.
Configuration Highlights
| Setting | Description |
|---|---|
feature_addresses | One or more feature URIs to cluster on (dense, sparse, multi-vector). |
algorithm | kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, or optics. |
preprocessing_steps | Ordered preprocessing before clustering: whitening, UMAP reduction, or both chained. |
hierarchical | Enable recursive sub-clustering within each cluster for multi-level grouping. |
llm_labeling | Generate cluster labels, summaries, and keywords using configured LLM providers. |
sample_size | Run on a subset before clustering the full dataset. |
Preprocessing Steps
High-dimensional embeddings (1408d, 3072d) benefit from preprocessing before density-based algorithms like HDBSCAN. Thepreprocessing_steps field accepts an ordered list of transformations applied before clustering.
Embedding Whitening (Soft-ZCA)
Whitening decorrelates embedding dimensions, removing redundant structure that causes density-based algorithms to over-fragment clusters. Uses ZCA whitening to preserve the original coordinate space.UMAP Pre-Reduction
UMAP reduces dimensionality while preserving local neighborhood structure — critical for HDBSCAN which suffers from the curse of dimensionality on raw embeddings. Defaults are optimized for clustering: 50 components, cosine metric, 30 neighbors, 0.0 min_dist.Chained Preprocessing
Chain whitening and UMAP together for best results on high-dimensional embeddings. Steps execute in order.Cosine Similarity Metrics
Every cluster execution automatically computes cosine similarity between each member and its assigned centroid. These metrics appear on both member documents and centroids:| Field | Level | Description |
|---|---|---|
cosine_similarity_to_centroid | Member | How well this document fits its cluster (0-1, higher is better) |
mean_cosine_similarity | Centroid | Average member similarity (cluster cohesion) |
min_cosine_similarity | Centroid | Weakest member similarity (indicates outliers) |
Incremental Assignment
After an initial clustering run, assign new documents to existing clusters without re-running the full algorithm. Passmode: "assign" when executing:
| Parameter | Default | Description |
|---|---|---|
mode | "full" | "full" re-clusters from scratch, "assign" uses existing centroids |
assignment_threshold | 0.5 | Minimum cosine similarity to assign to a cluster. Below this, the document is marked as noise (cluster_id = -1). |
Assign mode loads centroids from the most recent successful execution. Documents are compared to every centroid using cosine similarity and assigned to the closest match above the threshold. This is O(n*k) — fast enough for continuous ingestion.
Hierarchical Sub-Clustering
Enable recursive sub-clustering to discover structure within clusters. Each cluster with enough members is further divided using UMAP + HDBSCAN, producing a hierarchy of nested cluster IDs.cl_0_sub_1_sub_0, and each centroid includes parent_cluster_id, child_cluster_ids, and hierarchy_level fields. Use this for large collections where top-level clusters are too broad (e.g., “sports” → “basketball” → “NBA highlights”).
Quality Metrics
Every execution returns quality metrics that measure cluster separation and cohesion:| Metric | Range | Meaning |
|---|---|---|
silhouette_score | -1 to 1 | Overall cluster separation quality. Above 0.5 is good. |
mean_cosine_to_centroid | 0 to 1 | Average assignment confidence across all members. |
noise_ratio | 0 to 1 | Fraction of documents classified as noise. |
cluster_size_entropy | 0 to 1 | Normalized entropy of cluster sizes (1.0 = perfectly balanced). |
should_recluster | 0 or 1 | Automatic recommendation based on metric thresholds. |
GET /v1/clusters/{id}/executions. Use them to set up monitoring or trigger reclustering when quality degrades.
LLM Labeling
LLM labeling generates human-readable names, summaries, and keywords for each cluster by sending representative documents to an LLM. You control exactly which document fields the LLM sees using input mappings — the same system used by retrievers and taxonomies.Input Mappings
Each input mapping tells the labeling engine how to extract a value from a document payload and pass it to the LLM. You specify:| Field | Description |
|---|---|
input_key | The key the LLM receives — e.g. text, image_url, video_url, audio_url. |
source_type | Where to pull the value from: payload, blob, or literal. |
path | Dot-notation path into the document (for payload and blob source types). |
override | Static value (only used with literal source type). |
Without
labeling_inputs, the full document payload is serialized as JSON and sent to the LLM. This works but is noisy — input mappings let you send only the fields that matter.Source Types
payload — Read from document fields
payload — Read from document fields
Pull a value from any field in the document payload using dot-notation:Nested paths work too:Multiple text mappings with
input_key: "text" are concatenated automatically.blob — Read from document blobs (images, videos, audio)
blob — Read from document blobs (images, videos, audio)
Pull a URL from the document’s stored blobs. Blobs are assets generated during processing — thumbnails, scene frames, source files, etc.The path navigates into blob arrays:
document_blobs.0.url— first derived blob (e.g. scene thumbnail)source_blobs.0.url— original source file0.url— first blob from either array (document_blobs checked first)
literal — Static value
literal — Static value
Pass a fixed value, useful for adding context or instructions:
Text-Only Labeling
For collections with text content, map the relevant text fields:Multimodal Labeling
Send images or video alongside text for richer labels. Use a vision-capable model like Gemini 2.5 Flash:Custom Prompts and Response Shapes
Override the default labeling prompt for domain-specific terminology:response_shape to get structured output beyond the default label/keywords:
Advanced Settings
| Setting | Default | Description |
|---|---|---|
max_samples_per_cluster | 5 | Number of representative documents sent to the LLM per cluster. More samples = better labels but higher cost. |
sample_text_max_length | — | Truncate text inputs to this character length. |
include_summary | true | Generate a longer summary alongside the label. |
include_keywords | true | Generate keyword tags for each cluster. |
use_embedding_dedup | false | Deduplicate similar labels across clusters using embedding similarity. |
embedding_similarity_threshold | 0.8 | Threshold for dedup — labels above this similarity are merged. |
cache_ttl_seconds | 604800 | Cache labels for this duration (default: 7 days). Set to 0 to disable. |
Execution & Triggers
- Manual run:
POST /v1/clusters/{id}/execute - Submit asynchronous job:
POST /v1/clusters/{id}/execute/submit - Automated triggers: create cron, interval, or event-based triggers under
/v1/clusters/triggers. Execution history is accessible via trigger endpoints. - Every run yields a
run_id, exposes status via/v1/clusters/{id}/executions, and can be monitored through task polling.
Artifacts
| Artifact | Endpoint | Contents |
|---|---|---|
| Centroids | /executions/{run_id}/artifacts?include_centroids=true | Cluster ID, centroid vectors, counts, labels, summaries, keywords |
| Members | /executions/{run_id}/artifacts?include_members=true | Point IDs, reduced coordinates (x, y, z), cluster assignment |
| Streaming data | /executions/{run_id}/data | Stream centroids and members (Parquet-backed) for visualization |
3D Visualization
Whendimension_reduction.components is set to 3, the API returns three coordinates per member (x, y, z). In Studio, the third dimension is rendered as dot size — points with a higher z value appear larger, creating a depth cue on the 2D scatter plot. This avoids the complexity of a full 3D renderer while still surfacing the additional structure captured by the third principal component or UMAP axis.
Enrichment
Apply cluster membership back to collections:cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.
Monitoring & Management
GET /v1/clusters/{id}– inspect definition, latest run, enrichment status.POST /v1/clusters/list– search and filter cluster definitions.GET /v1/clusters/{id}/executions– view execution history and metrics.DELETE /v1/clusters/{id}– remove obsolete definitions (artifacts remain unless deleted separately).- Webhooks notify you when clustering jobs complete; integrate with alerting or automation.
Best Practices
- Preprocess high-dimensional embeddings – use
preprocessing_stepswith whitening + UMAP for embeddings above 512 dimensions. This dramatically improves HDBSCAN cluster quality. - Prototype on samples – tune algorithm parameters using a small
sample_sizebefore running at scale. - Use incremental assignment for streaming data – run
mode: "full"periodically, thenmode: "assign"for new documents between full runs. - Monitor quality metrics – check
silhouette_score,noise_ratio, andmean_cosine_to_centroidafter each execution. Recluster whenshould_reclusterfires. - Use hierarchical clustering for large collections – enable
hierarchical: truewhen top-level clusters are too broad to be actionable. - Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
- Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.

