Mixpeek Logo
    Clustering

    Unsupervised Clustering & Theme Discovery

    Clusters content into semantic groups using HDBSCAN, surfacing themes, variants, and outliers. Turns raw corpora into navigable structure without labels.

    video
    image
    text
    audio
    Multi-Stage
    54.0K runs
    Deploy Recipe

    Why This Matters

    Clustering is operational insight infrastructure. Once computed, clusters become queryable resources for navigation, QA, and theme-based retrieval.

    from mixpeek import Mixpeek
    client = Mixpeek(api_key="your-api-key")
    # Create collection with embeddings
    collection = client.collections.create(
    collection_name="unlabeled_corpus",
    feature_extractor={
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1"
    }
    )
    # Run clustering
    clusters = client.analytics.cluster(
    collection_id=collection.id,
    algorithm="hdbscan",
    min_cluster_size=15,
    return_outliers=True
    )
    # Generate cluster summaries with LLM
    for cluster in clusters:
    summary = client.llm.summarize(
    cluster_id=cluster.id,
    sample_size=10
    )
    # Search within a cluster
    results = client.retrievers.execute(
    retriever_id="cluster-search",
    inputs={
    "cluster_id": "cluster_42",
    "query_text": "specific theme"
    }
    )

    Feature Extractors

    Image Embedding

    Generate visual embeddings for similarity search and clustering

    752K runs

    Text Embedding

    Extract semantic embeddings from documents, transcripts and text content

    827K runs

    Video Embedding

    Generate vector embeddings for video content

    610K runs

    Audio Embedding

    Extract semantic embeddings from audio content for similarity search

    420K runs

    Retriever Stages

    Enrichment Resources

    Clustering
    Analytics

    Documentation