> ## Documentation Index
> Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Sample

> Select a random or stratified sample of documents from results

<Frame>
  <img src="https://mintcdn.com/mixpeek/TmiAqiYj-LwmWL2a/assets/retrievers/sample.svg?fit=max&auto=format&n=TmiAqiYj-LwmWL2a&q=85&s=19f8bd5a9bd579e271fe9afaed60a975" alt="Sample stage showing random and stratified sampling of results" width="800" height="300" data-path="assets/retrievers/sample.svg" />
</Frame>

The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories.

<Note>
  **Stage Category**: REDUCE (Samples documents)

  **Transformation**: N documents → S sampled documents (where S ≤ N)
</Note>

## When to Use

| Use Case                   | Description                        |
| -------------------------- | ---------------------------------- |
| **Representative samples** | Get a sample of large result sets  |
| **A/B testing**            | Random document selection          |
| **Stratified selection**   | Equal representation per category  |
| **Cost reduction**         | Sample before expensive operations |

## When NOT to Use

| Scenario                | Recommended Alternative |
| ----------------------- | ----------------------- |
| Top-N by relevance      | `limit` or `rerank`     |
| Diversity by similarity | `mmr`                   |
| Remove duplicates       | `deduplicate`           |
| All results needed      | Skip sampling           |

## Parameters

| Parameter         | Type    | Default  | Description                                            |
| ----------------- | ------- | -------- | ------------------------------------------------------ |
| `strategy`        | string  | `random` | Sampling strategy: `random`, `stratified`, `reservoir` |
| `count`           | integer | `10`     | Number of documents to sample                          |
| `seed`            | integer | *random* | Random seed for reproducibility                        |
| `stratify_by`     | string  | *none*   | Field for stratified sampling                          |
| `min_per_stratum` | integer | `1`      | Minimum samples per stratum                            |
| `preserve_top_k`  | integer | `0`      | Always keep top K by score, sample the rest            |

## Sampling Strategies

| Strategy     | Description                    | Best For               |
| ------------ | ------------------------------ | ---------------------- |
| `random`     | Uniform random selection       | General sampling       |
| `stratified` | Proportional samples per group | Category balance       |
| `reservoir`  | Memory-efficient sampling      | Very large result sets |

## Configuration Examples

<CodeGroup>
  ```json Random Sample theme={null}
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 20
      }
    }
  }
  ```

  ```json Reproducible Random Sample theme={null}
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 10,
        "seed": 42
      }
    }
  }
  ```

  ```json Stratified by Category theme={null}
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.category",
        "count": 30,
        "min_per_stratum": 3
      }
    }
  }
  ```

  ```json Reservoir Sampling theme={null}
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "reservoir",
        "count": 25
      }
    }
  }
  ```

  ```json Stratified with Total Limit theme={null}
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.source",
        "count": 30
      }
    }
  }
  ```
</CodeGroup>

## How Sampling Works

### Random Sampling

Selects documents with uniform probability:

```
Input: [A, B, C, D, E, F, G, H, I, J] (10 docs)
Sample(count=3): [D, G, B] (random selection)
```

### Stratified Sampling

Ensures representation from each group:

```
Input:
  - Category A: [A1, A2, A3, A4, A5]
  - Category B: [B1, B2, B3]
  - Category C: [C1, C2]

Stratified(min_per_stratum=2):
  [A1, A3, B2, B1, C1, C2]
```

### Reservoir Sampling

Memory-efficient uniform sampling for very large or streaming result sets:

```
Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (processed one at a time)
Reservoir(count=3): [2, 6, 9] (uniform random, single pass)
```

## Output Schema

```json theme={null}
{
  "documents": [
    {
      "document_id": "doc_123",
      "content": "Sampled document content...",
      "score": 0.85,
      "sample": {
        "method": "stratified",
        "stratum": "electronics",
        "sample_index": 0
      }
    }
  ],
  "metadata": {
    "method": "stratified",
    "total_input": 100,
    "sample_size": 15,
    "strata": {
      "electronics": {"input": 45, "sampled": 5},
      "clothing": {"input": 35, "sampled": 5},
      "books": {"input": 20, "sampled": 5}
    }
  }
}
```

## Performance

| Metric         | Value                              |
| -------------- | ---------------------------------- |
| **Latency**    | \< 5ms                             |
| **Memory**     | O(N)                               |
| **Cost**       | Free                               |
| **Complexity** | O(N) random, O(N log N) stratified |

## Common Pipeline Patterns

### Search + Sample for Testing

```json theme={null}
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 1000 }
        ],
        "final_top_k": 1000
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 50,
        "seed": 42
      }
    }
  }
]
```

### Balanced Category Sample

```json theme={null}
[
  {
    "stage_name": "hybrid_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 500 }
        ],
        "final_top_k": 500
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.category",
        "count": 25,
        "min_per_stratum": 5
      }
    }
  }
]
```

### Sample Before LLM Processing

```json theme={null}
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "top_k": 50
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 10
      }
    }
  },
  {
    "stage_name": "llm_enrichment",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "model": "gpt-4o",
        "prompt": "Extract key insights",
        "output_field": "insights"
      }
    }
  }
]
```

### Cluster + Sample Representatives

```json theme={null}
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 10
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "cluster.cluster_id",
        "count": 20,
        "min_per_stratum": 2
      }
    }
  }
]
```

### Multi-Source Balanced Sample

```json theme={null}
[
  {
    "stage_name": "structured_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "conditions": {
          "field": "metadata.date",
          "operator": "gte",
          "value": "2024-01-01"
        }
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.source",
        "count": 30,
        "min_per_stratum": 10
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o",
        "prompt": "Compare perspectives from different sources on this topic\n\n{{DOCUMENTS}}"
      }
    }
  }
]
```

## Stratified Sampling Details

### Minimum Per Stratum

```json theme={null}
{
  "strategy": "stratified",
  "stratify_by": "metadata.category",
  "count": 30,
  "min_per_stratum": 5
}
```

Each group is guaranteed at least 5 samples (if available), with the remainder allocated proportionally up to `count`.

### Proportional Allocation

```json theme={null}
{
  "strategy": "stratified",
  "stratify_by": "metadata.category",
  "count": 30
}
```

Stratified sampling allocates samples proportional to group size by default.

## Reproducibility

Use `seed` for reproducible results:

```json theme={null}
{
  "strategy": "random",
  "count": 20,
  "seed": 12345
}
```

Same seed + same input = same output.

## Error Handling

| Error                | Behavior             |
| -------------------- | -------------------- |
| count > input        | Return all documents |
| Empty stratum        | Skip that stratum    |
| Invalid stratify\_by | Fall back to random  |
| count = 0            | Return empty         |

## Sample vs Other Reduction Stages

| Stage         | Selection Basis       | Deterministic |
| ------------- | --------------------- | ------------- |
| `sample`      | Random/Stratified     | With seed     |
| `limit`       | Position              | Yes           |
| `mmr`         | Diversity + relevance | Yes           |
| `deduplicate` | Uniqueness            | Yes           |

## Related

* [Aggregate](/retrieval/stages/aggregate) - Statistical analysis
* [Group By](/retrieval/stages/group-by) - Group before sampling
* [Cluster](/retrieval/stages/cluster) - Semantic grouping
* [MMR](/retrieval/stages/mmr) - Diversity-based selection
