Deduplication
Multimodal Deduplication & Near-Duplicate Discovery
Find visually similar content using embedding similarity and perceptual hashing. Identifies duplicates, near-duplicates, and content reuse across your corpus.
image
video
Multi-Stage
98.0K runs
Deploy RecipeWhy This Matters
Critical for content moderation, copyright protection, dataset QA, and preventing duplicate training data.
from mixpeek import Mixpeekclient = Mixpeek(api_key="your-api-key")# Create collection with vision embeddingscollection = client.collections.create(collection_name="visual_content",feature_extractor={"feature_extractor_name": "multimodal_extractor","version": "v1"})# Search for near-duplicatesresults = client.retrievers.execute(retriever_id="dedup-retriever",inputs={"query_embedding": reference_embedding,"similarity_threshold": 0.92},limit=100)# Get duplicate clustersclusters = client.analytics.cluster(collection_id=collection.id,algorithm="connected_components",similarity_threshold=0.95)
Retrieval Flow
1
feature search(search)
KNN search for similar embeddings
2
score filter(filter)
Similarity threshold filtering
3
deduplicate(reduce)
Group duplicate clusters
Feature Extractors
Feature Extractors
Image Embedding
Generate visual embeddings for similarity search and clustering
752K runs
Video Embedding
Generate vector embeddings for video content
610K runs
Retriever Stages
feature search
Search collections using multimodal embeddings
search
score filter
Filter documents by relevance score threshold
filter
deduplicate
Remove duplicate documents based on field values
reduce
Enrichment Resources
Clustering
