Workflow Overview
- Define cluster (
POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy. - Execute – run manually (
POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...). - Inspect artifacts – fetch centroids, members, or reduced coordinates (
/v1/clusters/{id}/artifacts). - Enrich documents – write
cluster_id, labels, and keywords back into collections. - Promote to taxonomy (optional) – convert stable clusters into reference nodes.
Configuration Highlights
| Setting | Description |
|---|---|
feature_addresses | One or more feature URIs to cluster on (dense, sparse, multi-vector). |
algorithm | kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, or optics. |
dimension_reduction | Optional UMAP / PCA for visualization coordinates. |
llm_labeling | Generate cluster labels, summaries, and keywords using configured LLM providers. |
hierarchical | Enable to compute parent-child cluster relationships. |
sample_size | Run on a subset before clustering the full dataset. |
LLM Labeling
LLM labeling generates human-readable names, summaries, and keywords for each cluster by sending representative documents to an LLM. You control exactly which document fields the LLM sees using input mappings — the same system used by retrievers and taxonomies.Input Mappings
Each input mapping tells the labeling engine how to extract a value from a document payload and pass it to the LLM. You specify:| Field | Description |
|---|---|
input_key | The key the LLM receives — e.g. text, image_url, video_url, audio_url. |
source_type | Where to pull the value from: payload, blob, or literal. |
path | Dot-notation path into the document (for payload and blob source types). |
override | Static value (only used with literal source type). |
Without
labeling_inputs, the full document payload is serialized as JSON and sent to the LLM. This works but is noisy — input mappings let you send only the fields that matter.Source Types
payload — Read from document fields
payload — Read from document fields
Pull a value from any field in the document payload using dot-notation:Nested paths work too:Multiple text mappings with
input_key: "text" are concatenated automatically.blob — Read from document blobs (images, videos, audio)
blob — Read from document blobs (images, videos, audio)
Pull a URL from the document’s stored blobs. Blobs are assets generated during processing — thumbnails, scene frames, source files, etc.The path navigates into blob arrays:
document_blobs.0.url— first derived blob (e.g. scene thumbnail)source_blobs.0.url— original source file0.url— first blob from either array (document_blobs checked first)
literal — Static value
literal — Static value
Pass a fixed value, useful for adding context or instructions:
Text-Only Labeling
For collections with text content, map the relevant text fields:Multimodal Labeling
Send images or video alongside text for richer labels. Use a vision-capable model like Gemini 2.5 Flash:Custom Prompts and Response Shapes
Override the default labeling prompt for domain-specific terminology:response_shape to get structured output beyond the default label/keywords:
Advanced Settings
| Setting | Default | Description |
|---|---|---|
max_samples_per_cluster | 5 | Number of representative documents sent to the LLM per cluster. More samples = better labels but higher cost. |
sample_text_max_length | — | Truncate text inputs to this character length. |
include_summary | true | Generate a longer summary alongside the label. |
include_keywords | true | Generate keyword tags for each cluster. |
use_embedding_dedup | false | Deduplicate similar labels across clusters using embedding similarity. |
embedding_similarity_threshold | 0.8 | Threshold for dedup — labels above this similarity are merged. |
cache_ttl_seconds | 604800 | Cache labels for this duration (default: 7 days). Set to 0 to disable. |
Execution & Triggers
- Manual run:
POST /v1/clusters/{id}/execute - Submit asynchronous job:
POST /v1/clusters/{id}/execute/submit - Automated triggers: create cron, interval, or event-based triggers under
/v1/clusters/triggers. Execution history is accessible via trigger endpoints. - Every run yields a
run_id, exposes status via/v1/clusters/{id}/executions, and can be monitored through task polling.
Artifacts
| Artifact | Endpoint | Contents |
|---|---|---|
| Centroids | /executions/{run_id}/artifacts?include_centroids=true | Cluster ID, centroid vectors, counts, labels, summaries, keywords |
| Members | /executions/{run_id}/artifacts?include_members=true | Point IDs, reduced coordinates (x, y, z), cluster assignment |
| Streaming data | /executions/{run_id}/data | Stream centroids and members (Parquet-backed) for visualization |
Enrichment
Apply cluster membership back to collections:cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.
Monitoring & Management
GET /v1/clusters/{id}– inspect definition, latest run, enrichment status.POST /v1/clusters/list– search and filter cluster definitions.GET /v1/clusters/{id}/executions– view execution history and metrics.DELETE /v1/clusters/{id}– remove obsolete definitions (artifacts remain unless deleted separately).- Webhooks notify you when clustering jobs complete; integrate with alerting or automation.
Best Practices
- Prototype on samples – tune algorithm parameters using a small
sample_sizebefore running at scale. - Automate freshness – use triggers (cron or event-based) to keep clusters aligned with new data.
- Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
- Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.
- Watch metrics – use execution statistics (duration, member counts) to detect drift or parameter issues.

