Enhanced

Content Clustering Pipeline

Automatically group similar content together using embedding-based clustering. Discover themes, identify duplicates, and organize large content libraries.

text

image

video

Multi-Tier

2.1K runs

Deploy Recipe

from mixpeek import Mixpeek

client = Mixpeek(api_key="YOUR_API_KEY")

namespace = client.namespaces.create(name="clusters")
collection = client.collections.create(
    namespace_id=namespace.id,
    name="articles",
    extractors=["text-embedding-v2"]
)

# Upload content
client.buckets.upload(
    collection_id=collection.id,
    url="s3://your-bucket/articles/"
)

# Create clusters
cluster_job = client.clusters.create(
    namespace_id=namespace.id,
    collection_ids=[collection.id],
    num_clusters=20,
    algorithm="kmeans"
)

# Get cluster assignments
clusters = client.clusters.list(namespace_id=namespace.id)
for cluster in clusters:
    print(f"Cluster: {cluster.label} ({cluster.document_count} docs)")

Feature Extractors

Retriever Stages

Related Recipes & Resources

Explore these related resources to deepen your understanding and discover more powerful features

Recipe

Multimodal RAG Pipeline

Build a retrieval-augmented generation system that works with text, images, and video. Feed relevant multimodal context to LLMs for grounded responses.

Learn more

Recipe

Taxonomy Enrichment Pipeline

Automatically classify and tag content using custom taxonomies. Map your content to IAB categories, custom hierarchies, or industry-specific classifications.

Learn more

Recipe

Metadata Enrichment Pipeline

Automatically enrich your data with extracted metadata: entities, topics, sentiment, language, and custom attributes. Transform raw content into structured, queryable data.

Learn more

Recipe

Multimodal Hybrid Search Pipeline

Combine vector search with keyword search (BM25) across text, images, and video for the most comprehensive multimodal retrieval system.

Learn more

Recipe

Clinical Documentation Structuring

Production-grade pipeline for ingesting clinical documents — scanned charts, EHR exports, wound photos, and therapy notes — and structuring them into coded fields aligned with MDS 3.0, PDPM, and CMS audit requirements. Combines OCR, clinical NER, taxonomy classification, and hybrid retrieval to turn unstructured bedside documentation into queryable, auditable data.

Learn more

Recipe

Multimodal Content Moderation

Automated content moderation pipeline that analyzes text, images, and video for policy violations. Uses hierarchical taxonomy classification to label content as safe, sensitive, or prohibited across multiple categories simultaneously.

Learn more