Mixpeek Logo

    Image Deduplication Pipeline

    Identify near-duplicate and visually similar images in your collection using embedding-based clustering. Groups images by visual similarity, flags duplicates, and provides confidence scores to help clean up redundant content in media libraries and product catalogs.

    image
    Multi-Stage
    1.5K runs
    Deploy Recipe
    from mixpeek import Mixpeek
    client = Mixpeek(api_key="YOUR_API_KEY")
    # Create image collection with embeddings
    collection = client.collections.create(
    namespace_id="ns_your_namespace",
    name="image_library",
    extractors=["multimodal-extractor"]
    )
    # Upload images
    client.buckets.upload(bucket_id="bkt_images", url="s3://your-bucket/images/")
    # Create a cluster to find duplicate groups
    cluster = client.clusters.create(
    namespace_id="ns_your_namespace",
    name="dedup_clusters",
    collection_ids=["col_image_library"],
    similarity_threshold=0.92,
    vector_path="multimodal_extractor_v1_embedding"
    )
    # Review duplicate clusters
    clusters = client.clusters.list(cluster_id=cluster["cluster_id"])
    for group in clusters["results"]:
    print(f"Cluster {group['cluster_id']}: {group['document_count']} duplicates")
    for doc in group["documents"]:
    print(f" - {doc['document_id']} (similarity: {doc['score']:.3f})")

    Feature Extractors

    Retriever Stages

    aggregate

    Compute aggregations (COUNT, SUM, AVG, etc.) on pipeline results

    reduce