> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deduplicate

> Remove duplicate documents by field match or content similarity

<Frame>
  <img src="https://mintcdn.com/mixpeek/TwtTrae3Fi3EFJ72/assets/retrievers/deduplicate.svg?fit=max&auto=format&n=TwtTrae3Fi3EFJ72&q=85&s=38b32d27d3be3567bcb3599d6630cbff" alt="Deduplicate stage showing removal of duplicate documents" width="1000" height="380" data-path="assets/retrievers/deduplicate.svg" />
</Frame>

The Deduplicate stage removes duplicate documents from the result set based on exact field matching or content similarity. This is analogous to SQL's `DISTINCT`, MongoDB's `$group` with `$first`, and Elasticsearch's field collapsing.

<Note>
  **Stage Category**: REDUCE (Removes duplicates)

  **Transformation**: N documents → M documents (M ≤ N, duplicates removed)
</Note>

## When to Use

| Use Case                    | Description                                       |
| --------------------------- | ------------------------------------------------- |
| **URL deduplication**       | One result per source URL after web enrichment    |
| **Author collapse**         | Keep one result per author                        |
| **Content dedup**           | Remove near-identical text chunks                 |
| **Multi-source merge**      | Remove overlapping results from multiple searches |
| **Query expansion cleanup** | Remove duplicates from expanded query results     |

## When NOT to Use

| Scenario                   | Recommended Alternative  |
| -------------------------- | ------------------------ |
| Grouping with aggregation  | `group_by` stage         |
| Sampling unique categories | `sample` with stratified |
| Limiting result count      | `limit` stage            |
| Filtering by criteria      | `attribute_filter`       |

## Parameters

| Parameter              | Type          | Default              | Description                                                   |
| ---------------------- | ------------- | -------------------- | ------------------------------------------------------------- |
| `strategy`             | string        | `field`              | Dedup method: `field` (exact match) or `content` (similarity) |
| `fields`               | list\[string] | *required for field* | Field paths to compare for deduplication                      |
| `content_field`        | string        | `content`            | Text field for content-based dedup                            |
| `similarity_threshold` | float         | `0.95`               | Similarity threshold for content dedup (0.0-1.0)              |
| `keep`                 | string        | `first`              | Which duplicate to keep: `first` or `last`                    |
| `case_sensitive`       | boolean       | `true`               | Whether string comparisons are case-sensitive                 |

## Deduplication Strategies

| Strategy  | Performance     | Best For                              |
| --------- | --------------- | ------------------------------------- |
| `field`   | O(N) hash-based | Exact field matching (URL, ID, title) |
| `content` | O(N²) pairwise  | Near-duplicate text detection         |

## Configuration Examples

<CodeGroup>
  ```json Deduplicate by URL theme={null}
  {
    "stage_name": "deduplicate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "deduplicate",
      "parameters": {
        "strategy": "field",
        "fields": ["metadata.source_url"],
        "keep": "first"
      }
    }
  }
  ```

  ```json Case-Insensitive Author Dedup theme={null}
  {
    "stage_name": "deduplicate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "deduplicate",
      "parameters": {
        "strategy": "field",
        "fields": ["metadata.author"],
        "case_sensitive": false
      }
    }
  }
  ```

  ```json Multi-Field Dedup theme={null}
  {
    "stage_name": "deduplicate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "deduplicate",
      "parameters": {
        "strategy": "field",
        "fields": ["metadata.author", "metadata.title"]
      }
    }
  }
  ```

  ```json Content Similarity Dedup theme={null}
  {
    "stage_name": "deduplicate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "deduplicate",
      "parameters": {
        "strategy": "content",
        "content_field": "content",
        "similarity_threshold": 0.9,
        "keep": "first"
      }
    }
  }
  ```
</CodeGroup>

<Tip>
  For best results, place deduplicate after sorting/reranking so that `keep: "first"` retains the highest-scored duplicate. This ensures you keep the most relevant version of each document.
</Tip>

## Performance

| Metric         | Value                                                |
| -------------- | ---------------------------------------------------- |
| **Latency**    | \< 5ms (field) / 10-100ms (content)                  |
| **Memory**     | O(N) hash set (field) / O(N) content cache (content) |
| **Cost**       | Free                                                 |
| **Complexity** | O(N) field / O(N²) content                           |

## Common Pipeline Patterns

### Web Search Deduplication

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 50}],
        "final_top_k": 50
      }
    }
  },
  {
    "stage_name": "external_web_search",
    "stage_type": "apply",
    "config": {
      "stage_id": "external_web_search",
      "parameters": {
        "query": "{{INPUT.query}}",
        "num_results": 10
      }
    }
  },
  {
    "stage_name": "deduplicate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "deduplicate",
      "parameters": {
        "strategy": "field",
        "fields": ["metadata.source_url"]
      }
    }
  }
]
```

### Cross-Collection Dedup

```json theme={null}
[
  {
    "stage_name": "feature_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{"feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": {"input_mode": "text", "value": "{{INPUT.query}}"}, "top_k": 100}],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "query": "{{INPUT.query}}",
        "document_field": "content"
      }
    }
  },
  {
    "stage_name": "deduplicate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "deduplicate",
      "parameters": {
        "strategy": "content",
        "content_field": "content",
        "similarity_threshold": 0.85
      }
    }
  }
]
```

## Error Handling

| Error                | Behavior                                               |
| -------------------- | ------------------------------------------------------ |
| Field doesn't exist  | Documents with missing fields have `None` as key value |
| All unique documents | Returns all documents unchanged                        |
| Empty input          | Returns empty result set                               |
| Single document      | Returned as-is (no duplicates possible)                |

## Related

* [Group By](/retrieval/stages/group-by) - Group documents with aggregation
* [Limit](/retrieval/stages/limit) - Truncate results after deduplication
* [Sample](/retrieval/stages/sample) - Random sampling (different from dedup)
* [Unwind](/retrieval/stages/unwind) - Inverse: expand grouped items
