> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Cluster

> Group documents by embedding similarity into semantic clusters

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/retrievers/cluster.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=2e8b5f4a714a1beefd9733f7fbb0acbd" alt="Cluster stage showing semantic grouping of search results" width="900" height="350" data-path="assets/retrievers/cluster.svg" />
</Frame>

The Cluster stage groups documents based on embedding similarity, creating semantic clusters of related content. This helps organize search results into meaningful groups and discover themes within your results.

<Note>
  **Stage Category**: GROUP (Groups documents)

  **Transformation**: N documents → K clusters with documents
</Note>

## When to Use

| Use Case                | Description                       |
| ----------------------- | --------------------------------- |
| **Theme discovery**     | Find topics within search results |
| **Result organization** | Group similar items together      |
| **Deduplication**       | Find near-duplicate content       |
| **Exploration**         | Understand result diversity       |

## When NOT to Use

| Scenario                | Recommended Alternative |
| ----------------------- | ----------------------- |
| Grouping by field value | `group_by`              |
| Removing duplicates     | `deduplicate`           |
| Pre-defined categories  | `taxonomy_enrich`       |
| Single representative   | `sample` per group      |

## Parameters

| Parameter          | Type    | Default    | Description                                                        |
| ------------------ | ------- | ---------- | ------------------------------------------------------------------ |
| `n_clusters`       | integer | `5`        | Number of clusters (kmeans/agglomerative)                          |
| `feature_uri`      | string  | *auto*     | Embedding to cluster; auto-inherited from a prior `feature_search` |
| `output_mode`      | string  | `clusters` | `clusters`, `labeled`, or `representatives`                        |
| `algorithm`        | string  | `kmeans`   | Clustering algorithm                                               |
| `min_cluster_size` | integer | `2`        | Minimum documents per cluster                                      |
| `include_outliers` | boolean | `true`     | Include documents that don't fit clusters                          |
| `label_clusters`   | boolean | `false`    | Generate cluster labels with LLM                                   |

## Clustering Algorithms

| Algorithm       | Description                                | Best For                                   |
| --------------- | ------------------------------------------ | ------------------------------------------ |
| `auto`          | LLM picks the algorithm from dataset shape | You don't know the right algorithm upfront |
| `kmeans`        | K-means clustering                         | Fixed number of clusters                   |
| `hdbscan`       | Density-based clustering                   | Unknown cluster count                      |
| `dbscan`        | Density-based with fixed epsilon           | Noisy data with clear density separation   |
| `agglomerative` | Hierarchical clustering                    | Nested clusters                            |
| `spectral`      | Graph-based clustering                     | Non-convex cluster shapes                  |

<Info>
  `auto` runs one lightweight LLM call on the first batch — given sample size, dimensionality, variance, and intended cluster count — and locks in a concrete algorithm. Rides the wave: upgrading the LLM improves selection without code changes. Falls back to `kmeans` if the LLM call fails.
</Info>

## Configuration Examples

<CodeGroup>
  ```json Basic Clustering theme={null}
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 5
      }
    }
  }
  ```

  ```json KMeans With Known K theme={null}
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "algorithm": "kmeans",
        "n_clusters": 8
      }
    }
  }
  ```

  ```json Density-Based Clustering theme={null}
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "algorithm": "hdbscan",
        "min_cluster_size": 3
      }
    }
  }
  ```

  ```json Custom Embedding Field theme={null}
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 6,
        "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1"
      }
    }
  }
  ```

  ```json Fine-Grained Clustering theme={null}
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 15,
        "min_cluster_size": 2,
        "algorithm": "kmeans"
      }
    }
  }
  ```

  ```json Representative Documents theme={null}
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "algorithm": "kmeans",
        "n_clusters": 8,
        "output_mode": "representatives"
      }
    }
  }
  ```
</CodeGroup>

## How Clustering Works

1. **Extract Embeddings**: Get embedding vectors from each document
2. **Apply Algorithm**: Run clustering algorithm (e.g., k-means)
3. **Assign Documents**: Each document assigned to nearest cluster
4. **Compute Centroids**: Calculate cluster centers
5. **Label (optional)**: Generate human-readable cluster names

## Output Schema

```json theme={null}
{
  "clusters": [
    {
      "cluster_id": 0,
      "label": "Machine Learning Tutorials",
      "centroid": [0.12, -0.34, ...],
      "size": 12,
      "documents": [
        {
          "document_id": "doc_123",
          "content": "Introduction to neural networks...",
          "score": 0.95,
          "cluster": {
            "cluster_id": 0,
            "distance_to_centroid": 0.15
          }
        }
      ]
    },
    {
      "cluster_id": 1,
      "label": "Data Engineering",
      "centroid": [0.45, 0.23, ...],
      "size": 8,
      "documents": [...]
    }
  ],
  "outliers": [
    {
      "document_id": "doc_789",
      "content": "Unrelated content...",
      "outlier_reason": "distance_threshold_exceeded"
    }
  ],
  "metadata": {
    "algorithm": "kmeans",
    "n_clusters": 5,
    "total_documents": 50,
    "clustered_documents": 48,
    "outlier_count": 2
  }
}
```

## Performance

| Metric          | Value                         |
| --------------- | ----------------------------- |
| **Latency**     | 50-200ms                      |
| **Memory**      | O(N × embedding\_dim)         |
| **Cost**        | Free (+ LLM cost if labeling) |
| **Scalability** | Up to \~10K documents         |

<Warning>
  Clustering large document sets (10K+) can be slow. Consider pre-filtering or sampling before clustering.
</Warning>

## Common Pipeline Patterns

### Search + Cluster

```json theme={null}
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 100 }
        ],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 5
      }
    }
  }
]
```

### Cluster + Sample Representatives

```json theme={null}
[
  {
    "stage_name": "hybrid_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 10
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "cluster.cluster_id",
        "min_per_stratum": 2
      }
    }
  }
]
```

### Theme Discovery Pipeline

```json theme={null}
[
  {
    "stage_name": "structured_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "conditions": {
          "field": "metadata.date",
          "operator": "gte",
          "value": "2024-01-01"
        }
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "algorithm": "hdbscan",
        "min_cluster_size": 5
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "google",
        "model_name": "gemini-2.5-flash-lite",
        "prompt": "Summarize the main themes found in these clusters"
      }
    }
  }
]
```

### Diverse Results Pipeline

```json theme={null}
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 100 }
        ],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 5
      }
    }
  },
  {
    "stage_name": "code_execution",
    "stage_type": "apply",
    "config": {
      "stage_id": "code_execution",
      "parameters": {
        "code": "def transform(doc):\n    # Select top doc from each cluster\n    clusters = doc.get('clusters', [])\n    results = []\n    for c in clusters:\n        if c['documents']:\n            results.append(c['documents'][0])\n    doc['diverse_results'] = results\n    return doc"
      }
    }
  }
]
```

## Cluster Labeling

When `label_clusters: true`, an LLM generates descriptive labels:

| Cluster Documents      | Generated Label            |
| ---------------------- | -------------------------- |
| Docs about Python ML   | "Python Machine Learning"  |
| Docs about cloud infra | "Cloud Infrastructure"     |
| Docs about API design  | "REST API Design Patterns" |

## Choosing num\_clusters

| Result Size  | Recommended Clusters |
| ------------ | -------------------- |
| \< 50 docs   | 3-5 clusters         |
| 50-200 docs  | 5-10 clusters        |
| 200-500 docs | 8-15 clusters        |
| 500+ docs    | 10-20 clusters       |

<Tip>
  Start with fewer clusters and increase if clusters are too broad. Use HDBSCAN if you don't know the optimal number.
</Tip>

## Error Handling

| Error              | Behavior                |
| ------------------ | ----------------------- |
| Missing embeddings | Skip document           |
| Too few documents  | Reduce num\_clusters    |
| Clustering fails   | Return unclustered docs |
| Labeling fails     | Use numeric labels      |

## Related

* [Group By](/retrieval/stages/group-by) - Group by field values
* [Sample](/retrieval/stages/sample) - Select representatives
* [MMR](/retrieval/stages/mmr) - Diversity in ranking
* [Deduplicate](/retrieval/stages/deduplicate) - Remove duplicates
