Multimodal Deduplication
Find visually similar content using embedding similarity and perceptual hashing. Identifies duplicates, near-duplicates, and content reuse across your corpus.
"Find all near-duplicate images similar to this reference photo with 95% similarity threshold"
Why This Matters
Critical for content moderation, copyright protection, dataset QA, and preventing duplicate training data.
from mixpeek import Mixpeekclient = Mixpeek(api_key="your-api-key")# Create collection with vision embeddingscollection = client.collections.create(collection_name="visual_content",feature_extractor={"feature_extractor_name": "multimodal_extractor","version": "v1"})# Search for near-duplicatesresults = client.retrievers.execute(retriever_id="dedup-retriever",inputs={"query_embedding": reference_embedding,"similarity_threshold": 0.92},limit=100)# Get duplicate clustersclusters = client.analytics.cluster(collection_id=collection.id,algorithm="connected_components",similarity_threshold=0.95)
Retrieval Flow
KNN search for similar embeddings
Similarity threshold filtering
Group duplicate clusters
Feature Extractors
Feature Extractors
Image Embedding
Generate visual embeddings for similarity search and clustering
Video Embedding
Generate vector embeddings for video content
Retriever Stages
feature search
Search collections using multimodal embeddings
score filter
Filter documents by relevance score threshold
deduplicate
Remove duplicate documents based on field values
