Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Cluster visualization showing document groupings with centroids and member assignments
Clusters group similar documents using configurable algorithms running on the Engine’s Ray workers. They produce reusable artifacts, optional enrichments, and can even feed new taxonomies. Clustering is warehouse-native grouping: the multimodal equivalent of SQL GROUP BY, operating on embedding space rather than discrete column values.

Workflow Overview

  1. Define cluster (POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy.
  2. Execute – run manually (POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...).
  3. Inspect artifacts – fetch centroids, members, or reduced coordinates (/v1/clusters/{id}/artifacts).
  4. Enrich documents – write cluster_id, labels, and keywords back into collections.
  5. Promote to taxonomy (optional) – convert stable clusters into reference nodes.

Configuration Highlights

SettingDescription
feature_addressesOne or more feature URIs to cluster on (dense, sparse, multi-vector).
algorithmkmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, or optics.
preprocessing_stepsOrdered preprocessing before clustering: whitening, UMAP reduction, or both chained.
hierarchicalEnable recursive sub-clustering within each cluster for multi-level grouping.
llm_labelingGenerate cluster labels, summaries, and keywords using configured LLM providers.
sample_sizeRun on a subset before clustering the full dataset.
Example definition:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "product_topics",
    "collection_ids": ["col_products"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uri": "mixpeek://text_extractor@v1/text_embedding",
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
      "preprocessing_steps": [
        { "method": "whitening" },
        { "method": "umap", "n_components": 50, "n_neighbors": 30 }
      ]
    },
    "llm_labeling": {
      "enabled": true,
      "provider": "openai",
      "model_name": "gpt-4o-mini"
    }
  }'

Preprocessing Steps

High-dimensional embeddings (1408d, 3072d) benefit from preprocessing before density-based algorithms like HDBSCAN. The preprocessing_steps field accepts an ordered list of transformations applied before clustering.

Embedding Whitening (Soft-ZCA)

Whitening decorrelates embedding dimensions, removing redundant structure that causes density-based algorithms to over-fragment clusters. Uses ZCA whitening to preserve the original coordinate space.
{
  "preprocessing_steps": [
    { "method": "whitening", "regularization": 1e-5 }
  ]
}

UMAP Pre-Reduction

UMAP reduces dimensionality while preserving local neighborhood structure — critical for HDBSCAN which suffers from the curse of dimensionality on raw embeddings. Defaults are optimized for clustering: 50 components, cosine metric, 30 neighbors, 0.0 min_dist.
{
  "preprocessing_steps": [
    { "method": "umap", "n_components": 50, "n_neighbors": 30, "min_dist": 0.0, "metric": "cosine" }
  ]
}

Chained Preprocessing

Chain whitening and UMAP together for best results on high-dimensional embeddings. Steps execute in order.
{
  "preprocessing_steps": [
    { "method": "whitening", "regularization": 1e-5 },
    { "method": "umap", "n_components": 50, "n_neighbors": 30, "min_dist": 0.0 }
  ]
}
For raw video/image embeddings above 1000 dimensions, the whitening + UMAP chain typically improves HDBSCAN cluster purity by 15-30% compared to clustering on raw embeddings.

Cosine Similarity Metrics

Every cluster execution automatically computes cosine similarity between each member and its assigned centroid. These metrics appear on both member documents and centroids:
FieldLevelDescription
cosine_similarity_to_centroidMemberHow well this document fits its cluster (0-1, higher is better)
mean_cosine_similarityCentroidAverage member similarity (cluster cohesion)
min_cosine_similarityCentroidWeakest member similarity (indicates outliers)
Use these to identify borderline assignments, filter low-confidence members, or set quality thresholds for recluster triggers.

Incremental Assignment

After an initial clustering run, assign new documents to existing clusters without re-running the full algorithm. Pass mode: "assign" when executing:
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "assign",
    "assignment_threshold": 0.5
  }'
ParameterDefaultDescription
mode"full""full" re-clusters from scratch, "assign" uses existing centroids
assignment_threshold0.5Minimum cosine similarity to assign to a cluster. Below this, the document is marked as noise (cluster_id = -1).
Assign mode loads centroids from the most recent successful execution. Documents are compared to every centroid using cosine similarity and assigned to the closest match above the threshold. This is O(n*k) — fast enough for continuous ingestion.

Hierarchical Sub-Clustering

Enable recursive sub-clustering to discover structure within clusters. Each cluster with enough members is further divided using UMAP + HDBSCAN, producing a hierarchy of nested cluster IDs.
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "content_hierarchy",
    "collection_ids": ["col_videos"],
    "cluster_type": "vector",
    "vector_config": {
      "feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
      "clustering_method": "hdbscan",
      "algorithm_params": { "min_cluster_size": 50, "min_samples": 10 },
      "hierarchical": true,
      "max_hierarchy_depth": 3
    }
  }'
Sub-clusters get IDs like cl_0_sub_1_sub_0, and each centroid includes parent_cluster_id, child_cluster_ids, and hierarchy_level fields. Use this for large collections where top-level clusters are too broad (e.g., “sports” → “basketball” → “NBA highlights”).

Quality Metrics

Every execution returns quality metrics that measure cluster separation and cohesion:
MetricRangeMeaning
silhouette_score-1 to 1Overall cluster separation quality. Above 0.5 is good.
mean_cosine_to_centroid0 to 1Average assignment confidence across all members.
noise_ratio0 to 1Fraction of documents classified as noise.
cluster_size_entropy0 to 1Normalized entropy of cluster sizes (1.0 = perfectly balanced).
should_recluster0 or 1Automatic recommendation based on metric thresholds.
These are available in the execution results via GET /v1/clusters/{id}/executions. Use them to set up monitoring or trigger reclustering when quality degrades.

LLM Labeling

LLM labeling generates human-readable names, summaries, and keywords for each cluster by sending representative documents to an LLM. You control exactly which document fields the LLM sees using input mappings — the same system used by retrievers and taxonomies.

Input Mappings

Each input mapping tells the labeling engine how to extract a value from a document payload and pass it to the LLM. You specify:
FieldDescription
input_keyThe key the LLM receives — e.g. text, image_url, video_url, audio_url.
source_typeWhere to pull the value from: payload, blob, or literal.
pathDot-notation path into the document (for payload and blob source types).
overrideStatic value (only used with literal source type).
Without labeling_inputs, the full document payload is serialized as JSON and sent to the LLM. This works but is noisy — input mappings let you send only the fields that matter.

Source Types

Pull a value from any field in the document payload using dot-notation:
{
  "input_key": "text",
  "source_type": "payload",
  "path": "headline"
}
Nested paths work too:
{
  "input_key": "text",
  "source_type": "payload",
  "path": "metadata.description"
}
Multiple text mappings with input_key: "text" are concatenated automatically.
Pull a URL from the document’s stored blobs. Blobs are assets generated during processing — thumbnails, scene frames, source files, etc.
{
  "input_key": "image_url",
  "source_type": "blob",
  "path": "document_blobs.0.url"
}
The path navigates into blob arrays:
  • document_blobs.0.url — first derived blob (e.g. scene thumbnail)
  • source_blobs.0.url — original source file
  • 0.url — first blob from either array (document_blobs checked first)
S3 URLs are automatically presigned so the LLM provider can access them.
Pass a fixed value, useful for adding context or instructions:
{
  "input_key": "text",
  "source_type": "literal",
  "override": "Category: marketing videos"
}

Text-Only Labeling

For collections with text content, map the relevant text fields:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "script_archetypes",
    "source_collection_ids": ["col_ad_scripts"],
    "feature_addresses": ["mixpeek://text_extractor@v1/text_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 20 },
    "llm_labeling": {
      "provider": "openai",
      "model_name": "gpt-4o-mini",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "text", "source_type": "payload", "path": "description" }
        ]
      }
    }
  }'

Multimodal Labeling

Send images or video alongside text for richer labels. Use a vision-capable model like Gemini 2.5 Flash:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "scene_themes",
    "source_collection_ids": ["col_ad_scenes"],
    "feature_addresses": ["mixpeek://multimodal_extractor@v1/multimodal_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 25 },
    "llm_labeling": {
      "provider": "google",
      "model_name": "gemini-2.5-flash-preview-04-17",
      "labeling_inputs": {
        "input_mappings": [
          { "input_key": "text", "source_type": "payload", "path": "headline" },
          { "input_key": "text", "source_type": "payload", "path": "primary_text" },
          { "input_key": "image_url", "source_type": "blob", "path": "document_blobs.0.url" }
        ]
      }
    }
  }'
Use image_url for scene thumbnails and frames. Use video_url for full video clips. The LLM receives both the text and visual content when generating labels.

Custom Prompts and Response Shapes

Override the default labeling prompt for domain-specific terminology:
{
  "llm_labeling": {
    "provider": "openai",
    "model_name": "gpt-4o",
    "custom_prompt": "You are a creative strategist. Analyze these ad clusters and generate labels that describe the creative archetype (e.g. 'UGC Testimonial', 'Problem-Solution Demo'). Return JSON array: [{\"cluster_id\": \"cl_0\", \"label\": \"...\", \"keywords\": [...]}]",
    "labeling_inputs": {
      "input_mappings": [
        { "input_key": "text", "source_type": "payload", "path": "headline" },
        { "input_key": "text", "source_type": "payload", "path": "primary_text" }
      ]
    }
  }
}
Define a custom response_shape to get structured output beyond the default label/keywords:
{
  "llm_labeling": {
    "response_shape": {
      "label": "string",
      "keywords": ["string"],
      "sentiment": "positive | negative | neutral",
      "target_audience": "string"
    }
  }
}

Advanced Settings

SettingDefaultDescription
max_samples_per_cluster5Number of representative documents sent to the LLM per cluster. More samples = better labels but higher cost.
sample_text_max_lengthTruncate text inputs to this character length.
include_summarytrueGenerate a longer summary alongside the label.
include_keywordstrueGenerate keyword tags for each cluster.
use_embedding_dedupfalseDeduplicate similar labels across clusters using embedding similarity.
embedding_similarity_threshold0.8Threshold for dedup — labels above this similarity are merged.
cache_ttl_seconds604800Cache labels for this duration (default: 7 days). Set to 0 to disable.

Execution & Triggers

  • Manual run: POST /v1/clusters/{id}/execute
  • Submit asynchronous job: POST /v1/clusters/{id}/execute/submit
  • Automated triggers: create cron, interval, or event-based triggers under /v1/clusters/triggers. Execution history is accessible via trigger endpoints.
  • Every run yields a run_id, exposes status via /v1/clusters/{id}/executions, and can be monitored through task polling.

Artifacts

ArtifactEndpointContents
Centroids/executions/{run_id}/artifacts?include_centroids=trueCluster ID, centroid vectors, counts, labels, summaries, keywords
Members/executions/{run_id}/artifacts?include_members=truePoint IDs, reduced coordinates (x, y, z), cluster assignment
Streaming data/executions/{run_id}/dataStream centroids and members (Parquet-backed) for visualization
Artifacts are stored as Parquet in S3 for efficient downstream analytics and visualization.

3D Visualization

When dimension_reduction.components is set to 3, the API returns three coordinates per member (x, y, z). In Studio, the third dimension is rendered as dot size — points with a higher z value appear larger, creating a depth cue on the 2D scatter plot. This avoids the complexity of a full 3D renderer while still surfacing the additional structure captured by the third principal component or UMAP axis.
{
  "dimension_reduction": {
    "method": "umap",
    "components": 3
  }
}

Enrichment

Apply cluster membership back to collections:
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": "run_xyz789",
    "target_collection_id": "col_products_enriched",
    "fields": ["cluster_id", "label", "summary", "keywords"]
  }'
Enrichment writes cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.

Monitoring & Management

  • GET /v1/clusters/{id} – inspect definition, latest run, enrichment status.
  • POST /v1/clusters/list – search and filter cluster definitions.
  • GET /v1/clusters/{id}/executions – view execution history and metrics.
  • DELETE /v1/clusters/{id} – remove obsolete definitions (artifacts remain unless deleted separately).
  • Webhooks notify you when clustering jobs complete; integrate with alerting or automation.

Best Practices

  1. Preprocess high-dimensional embeddings – use preprocessing_steps with whitening + UMAP for embeddings above 512 dimensions. This dramatically improves HDBSCAN cluster quality.
  2. Prototype on samples – tune algorithm parameters using a small sample_size before running at scale.
  3. Use incremental assignment for streaming data – run mode: "full" periodically, then mode: "assign" for new documents between full runs.
  4. Monitor quality metrics – check silhouette_score, noise_ratio, and mean_cosine_to_centroid after each execution. Recluster when should_recluster fires.
  5. Use hierarchical clustering for large collections – enable hierarchical: true when top-level clusters are too broad to be actionable.
  6. Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
  7. Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.
Clustering is the bridge between raw embeddings and structured understanding. Use it to discover themes, power analytics, and bootstrap taxonomies that feed retrieval.