> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Deduplicate > Remove duplicate documents by field match or content similarity Deduplicate stage showing removal of duplicate documents

Deduplicate stage showing removal of duplicate documents

The Deduplicate stage removes duplicate documents from the result set based on exact field matching or content similarity. This is analogous to SQL's `DISTINCT`, MongoDB's `$group` with `$first`, and Elasticsearch's field collapsing. **Stage Category**: REDUCE (Removes duplicates) **Transformation**: N documents → M documents (M ≤ N, duplicates removed) ## When to Use | Use Case | Description | | --------------------------- | ------------------------------------------------- | | **URL deduplication** | One result per source URL after web enrichment | | **Author collapse** | Keep one result per author | | **Content dedup** | Remove near-identical text chunks | | **Multi-source merge** | Remove overlapping results from multiple searches | | **Query expansion cleanup** | Remove duplicates from expanded query results | ## When NOT to Use | Scenario | Recommended Alternative | | -------------------------- | ------------------------ | | Grouping with aggregation | `group_by` stage | | Sampling unique categories | `sample` with stratified | | Limiting result count | `limit` stage | | Filtering by criteria | `attribute_filter` | ## Parameters | Parameter | Type | Default | Description | | ---------------------- | ------------- | -------------------- | ------------------------------------------------------------- | | `strategy` | string | `field` | Dedup method: `field` (exact match) or `content` (similarity) | | `fields` | list\[string] | *required for field* | Field paths to compare for deduplication | | `content_field` | string | `content` | Text field for content-based dedup | | `similarity_threshold` | float | `0.95` | Similarity threshold for content dedup (0.0-1.0) | | `keep` | string | `first` | Which duplicate to keep: `first` or `last` | | `case_sensitive` | boolean | `true` | Whether string comparisons are case-sensitive | ## Deduplication Strategies | Strategy | Performance | Best For | | --------- | --------------- | ------------------------------------- | | `field` | O(N) hash-based | Exact field matching (URL, ID, title) | | `content` | O(N²) pairwise | Near-duplicate text detection | ## Configuration Examples ```json Deduplicate by URL theme={null} { "stage_name": "deduplicate", "stage_type": "reduce", "config": { "stage_id": "deduplicate", "parameters": { "strategy": "field", "fields": ["metadata.source_url"], "keep": "first" } } } ``` ```json Case-Insensitive Author Dedup theme={null} { "stage_name": "deduplicate", "stage_type": "reduce", "config": { "stage_id": "deduplicate", "parameters": { "strategy": "field", "fields": ["metadata.author"], "case_sensitive": false } } } ``` ```json Multi-Field Dedup theme={null} { "stage_name": "deduplicate", "stage_type": "reduce", "config": { "stage_id": "deduplicate", "parameters": { "strategy": "field", "fields": ["metadata.author", "metadata.title"] } } } ``` ```json Content Similarity Dedup theme={null} { "stage_name": "deduplicate", "stage_type": "reduce", "config": { "stage_id": "deduplicate", "parameters": { "strategy": "content", "content_field": "content", "similarity_threshold": 0.9, "keep": "first" } } } ``` For best results, place deduplicate after sorting/reranking so that `keep: "first"` retains the highest-scored duplicate. This ensures you keep the most relevant version of each document. ## Performance | Metric | Value | | -------------- | ---------------------------------------------------- | | **Latency** | \< 5ms (field) / 10-100ms (content) | | **Memory** | O(N) hash set (field) / O(N) content cache (content) | | **Cost** | Free | | **Complexity** | O(N) field / O(N²) content | ## Common Pipeline Patterns ### Web Search Deduplication ```json theme={null} [ { "stage_name": "feature_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 50}], "final_top_k": 50 } } }, { "stage_name": "external_web_search", "stage_type": "apply", "config": { "stage_id": "external_web_search", "parameters": { "query": "{{INPUT.query}}", "num_results": 10 } } }, { "stage_name": "deduplicate", "stage_type": "reduce", "config": { "stage_id": "deduplicate", "parameters": { "strategy": "field", "fields": ["metadata.source_url"] } } } ] ``` ### Cross-Collection Dedup ```json theme={null} [ { "stage_name": "feature_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}], "final_top_k": 100 } } }, { "stage_name": "rerank", "stage_type": "sort", "config": { "stage_id": "rerank", "parameters": { "inference_name": "BAAI__bge_reranker_v2_m3", "query": "{{INPUT.query}}", "document_field": "content" } } }, { "stage_name": "deduplicate", "stage_type": "reduce", "config": { "stage_id": "deduplicate", "parameters": { "strategy": "content", "content_field": "content", "similarity_threshold": 0.85 } } } ] ``` ## Error Handling | Error | Behavior | | -------------------- | ------------------------------------------------------ | | Field doesn't exist | Documents with missing fields have `None` as key value | | All unique documents | Returns all documents unchanged | | Empty input | Returns empty result set | | Single document | Returned as-is (no duplicates possible) | ## Related * [Group By](/retrieval/stages/group-by) - Group documents with aggregation * [Limit](/retrieval/stages/limit) - Truncate results after deduplication * [Sample](/retrieval/stages/sample) - Random sampling (different from dedup) * [Unwind](/retrieval/stages/unwind) - Inverse: expand grouped items