Image Deduplication Pipeline
Identify near-duplicate and visually similar images in your collection using embedding-based clustering. Groups images by visual similarity, flags duplicates, and provides confidence scores to help clean up redundant content in media libraries and product catalogs.
from mixpeek import Mixpeekclient = Mixpeek(api_key="YOUR_API_KEY")# Create image collection with embeddingscollection = client.collections.create(namespace_id="ns_your_namespace",name="image_library",extractors=["multimodal-extractor"])# Upload imagesclient.buckets.upload(bucket_id="bkt_images", url="s3://your-bucket/images/")# Create a cluster to find duplicate groupscluster = client.clusters.create(namespace_id="ns_your_namespace",name="dedup_clusters",collection_ids=["col_image_library"],similarity_threshold=0.92,vector_path="multimodal_extractor_v1_embedding")# Review duplicate clustersclusters = client.clusters.list(cluster_id=cluster["cluster_id"])for group in clusters["results"]:print(f"Cluster {group['cluster_id']}: {group['document_count']} duplicates")for doc in group["documents"]:print(f" - {doc['document_id']} (similarity: {doc['score']:.3f})")
Feature Extractors
Retriever Stages
aggregate
Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results
