Mixpeek Logo
    Deduplication

    Multimodal Deduplication & Near-Duplicate Discovery

    Find visually similar content using embedding similarity and perceptual hashing. Identifies duplicates, near-duplicates, and content reuse across your corpus.

    image
    video
    Multi-Stage
    98.0K runs
    Deploy Recipe

    Why This Matters

    Critical for content moderation, copyright protection, dataset QA, and preventing duplicate training data.

    from mixpeek import Mixpeek
    client = Mixpeek(api_key="your-api-key")
    # Create collection with vision embeddings
    collection = client.collections.create(
    collection_name="visual_content",
    feature_extractor={
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1"
    }
    )
    # Search for near-duplicates
    results = client.retrievers.execute(
    retriever_id="dedup-retriever",
    inputs={
    "query_embedding": reference_embedding,
    "similarity_threshold": 0.92
    },
    limit=100
    )
    # Get duplicate clusters
    clusters = client.analytics.cluster(
    collection_id=collection.id,
    algorithm="connected_components",
    similarity_threshold=0.95
    )

    Retrieval Flow

    1

    KNN search for similar embeddings

    2
    score filter(filter)

    Similarity threshold filtering

    3
    deduplicate(reduce)

    Group duplicate clusters

    Feature Extractors

    Image Embedding

    Generate visual embeddings for similarity search and clustering

    752K runs

    Video Embedding

    Generate vector embeddings for video content

    610K runs

    Retriever Stages

    feature search

    Search collections using multimodal embeddings

    search

    score filter

    Filter documents by relevance score threshold

    filter

    deduplicate

    Remove duplicate documents based on field values

    reduce

    Enrichment Resources

    Clustering

    Documentation