Mixpeek Logo
    Themes

    Clustering & Theme Discovery

    Unsupervised clustering that groups content into semantic themes using HDBSCAN. Surfaces hidden patterns, content variants, and outliers without requiring predefined labels.

    video
    image
    text
    audio
    Multi-Stage
    54.0K runs
    Deploy Recipe

    " in user-generated content and identify "

    Why This Matters

    You can't search for what you don't know exists. Clustering reveals the natural structure in your content—themes, duplicates, and anomalies—before you even ask.

    import requests
    API_URL = "https://api.mixpeek.com"
    headers = {"Authorization": "Bearer YOUR_API_KEY", "X-Namespace": "your-namespace"}
    # Create cluster configuration
    cluster = requests.post(f"{API_URL}/v1/clusters", headers=headers, json={
    "cluster_name": "content_themes",
    "source_collection_ids": ["col_my_collection"],
    "feature_addresses": ["mixpeek://multimodal_extractor@v1/embedding"],
    "algorithm": "hdbscan",
    "algorithm_config": {"min_cluster_size": 15},
    "llm_labeling": {"provider": "openai_chat_v1", "model": "gpt-4o-mini"}
    }).json()
    # Execute clustering
    execution = requests.post(
    f"{API_URL}/v1/clusters/{cluster['cluster_id']}/execute",
    headers=headers
    ).json()
    # Get cluster artifacts with centroids
    artifacts = requests.get(
    f"{API_URL}/v1/clusters/{cluster['cluster_id']}/executions/{execution['run_id']}/artifacts",
    headers=headers,
    params={"include_centroids": True}
    ).json()
    # Explore discovered themes
    for group in artifacts["clusters"]:
    print(f"Theme: {group['label']}")
    print(f"Size: {group['member_count']} items")
    print(f"Keywords: {', '.join(group.get('keywords', []))}")

    Feature Extractors

    Image Embedding

    Generate visual embeddings for similarity search and clustering

    752K runs

    Text Embedding

    Extract semantic embeddings from documents, transcripts and text content

    827K runs

    Video Embedding

    Generate vector embeddings for video content

    610K runs

    Audio Embedding

    Extract semantic embeddings from audio content for similarity search

    420K runs

    Retriever Stages

    Documentation