Unsupervised Clustering & Theme Discovery
Clusters content into semantic groups using HDBSCAN, surfacing themes, variants, and outliers. Turns raw corpora into navigable structure without labels.
"Discover hidden themes in unlabeled user-generated content and identify outliers"
Why This Matters
Clustering is operational insight infrastructure. Once computed, clusters become queryable resources for navigation, QA, and theme-based retrieval.
from mixpeek import Mixpeekclient = Mixpeek(api_key="your-api-key")# Create collection with embeddingscollection = client.collections.create(collection_name="unlabeled_corpus",feature_extractor={"feature_extractor_name": "multimodal_extractor","version": "v1"})# Run clusteringclusters = client.analytics.cluster(collection_id=collection.id,algorithm="hdbscan",min_cluster_size=15,return_outliers=True)# Generate cluster summaries with LLMfor cluster in clusters:summary = client.llm.summarize(cluster_id=cluster.id,sample_size=10)# Search within a clusterresults = client.retrievers.execute(retriever_id="cluster-search",inputs={"cluster_id": "cluster_42","query_text": "specific theme"})
Feature Extractors
Feature Extractors
Image Embedding
Generate visual embeddings for similarity search and clustering
Text Embedding
Extract semantic embeddings from documents, transcripts and text content
Video Embedding
Generate vector embeddings for video content
Audio Embedding
Extract semantic embeddings from audio content for similarity search
