> ## Documentation Index > Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Sample > Select a random or stratified sample of documents from results Sample stage showing random and stratified sampling of results

Sample stage showing random and stratified sampling of results

The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories. **Stage Category**: REDUCE (Samples documents) **Transformation**: N documents → S sampled documents (where S ≤ N) ## When to Use | Use Case | Description | | -------------------------- | ---------------------------------- | | **Representative samples** | Get a sample of large result sets | | **A/B testing** | Random document selection | | **Stratified selection** | Equal representation per category | | **Cost reduction** | Sample before expensive operations | ## When NOT to Use | Scenario | Recommended Alternative | | ----------------------- | ----------------------- | | Top-N by relevance | `limit` or `rerank` | | Diversity by similarity | `mmr` | | Remove duplicates | `deduplicate` | | All results needed | Skip sampling | ## Parameters | Parameter | Type | Default | Description | | ----------------- | ------- | -------- | ------------------------------------------------------ | | `strategy` | string | `random` | Sampling strategy: `random`, `stratified`, `reservoir` | | `count` | integer | `10` | Number of documents to sample | | `seed` | integer | *random* | Random seed for reproducibility | | `stratify_by` | string | *none* | Field for stratified sampling | | `min_per_stratum` | integer | `1` | Minimum samples per stratum | | `preserve_top_k` | integer | `0` | Always keep top K by score, sample the rest | ## Sampling Strategies | Strategy | Description | Best For | | ------------ | ------------------------------ | ---------------------- | | `random` | Uniform random selection | General sampling | | `stratified` | Proportional samples per group | Category balance | | `reservoir` | Memory-efficient sampling | Very large result sets | ## Configuration Examples ```json Random Sample theme={null} { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "random", "count": 20 } } } ``` ```json Reproducible Random Sample theme={null} { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "random", "count": 10, "seed": 42 } } } ``` ```json Stratified by Category theme={null} { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "stratified", "stratify_by": "metadata.category", "count": 30, "min_per_stratum": 3 } } } ``` ```json Reservoir Sampling theme={null} { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "reservoir", "count": 25 } } } ``` ```json Stratified with Total Limit theme={null} { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "stratified", "stratify_by": "metadata.source", "count": 30 } } } ``` ## How Sampling Works ### Random Sampling Selects documents with uniform probability: ``` Input: [A, B, C, D, E, F, G, H, I, J] (10 docs) Sample(count=3): [D, G, B] (random selection) ``` ### Stratified Sampling Ensures representation from each group: ``` Input: - Category A: [A1, A2, A3, A4, A5] - Category B: [B1, B2, B3] - Category C: [C1, C2] Stratified(min_per_stratum=2): [A1, A3, B2, B1, C1, C2] ``` ### Reservoir Sampling Memory-efficient uniform sampling for very large or streaming result sets: ``` Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (processed one at a time) Reservoir(count=3): [2, 6, 9] (uniform random, single pass) ``` ## Output Schema ```json theme={null} { "documents": [ { "document_id": "doc_123", "content": "Sampled document content...", "score": 0.85, "sample": { "method": "stratified", "stratum": "electronics", "sample_index": 0 } } ], "metadata": { "method": "stratified", "total_input": 100, "sample_size": 15, "strata": { "electronics": {"input": 45, "sampled": 5}, "clothing": {"input": 35, "sampled": 5}, "books": {"input": 20, "sampled": 5} } } } ``` ## Performance | Metric | Value | | -------------- | ---------------------------------- | | **Latency** | \< 5ms | | **Memory** | O(N) | | **Cost** | Free | | **Complexity** | O(N) random, O(N log N) stratified | ## Common Pipeline Patterns ### Search + Sample for Testing ```json theme={null} [ { "stage_name": "semantic_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [ { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 1000 } ], "final_top_k": 1000 } } }, { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "random", "count": 50, "seed": 42 } } } ] ``` ### Balanced Category Sample ```json theme={null} [ { "stage_name": "hybrid_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [ { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 500 } ], "final_top_k": 500 } } }, { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "stratified", "stratify_by": "metadata.category", "count": 25, "min_per_stratum": 5 } } } ] ``` ### Sample Before LLM Processing ```json theme={null} [ { "stage_name": "semantic_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [ { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 } ], "final_top_k": 200 } } }, { "stage_name": "rerank", "stage_type": "sort", "config": { "stage_id": "rerank", "parameters": { "inference_name": "BAAI__bge_reranker_v2_m3", "top_k": 50 } } }, { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "random", "count": 10 } } }, { "stage_name": "llm_enrichment", "stage_type": "enrich", "config": { "stage_id": "llm_enrich", "parameters": { "model": "gpt-4o", "prompt": "Extract key insights", "output_field": "insights" } } } ] ``` ### Cluster + Sample Representatives ```json theme={null} [ { "stage_name": "semantic_search", "stage_type": "filter", "config": { "stage_id": "feature_search", "parameters": { "searches": [ { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 } ], "final_top_k": 200 } } }, { "stage_name": "cluster", "stage_type": "group", "config": { "stage_id": "cluster", "parameters": { "n_clusters": 10 } } }, { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "stratified", "stratify_by": "cluster.cluster_id", "count": 20, "min_per_stratum": 2 } } } ] ``` ### Multi-Source Balanced Sample ```json theme={null} [ { "stage_name": "structured_filter", "stage_type": "filter", "config": { "stage_id": "attribute_filter", "parameters": { "conditions": { "field": "metadata.date", "operator": "gte", "value": "2024-01-01" } } } }, { "stage_name": "sample", "stage_type": "reduce", "config": { "stage_id": "sample", "parameters": { "strategy": "stratified", "stratify_by": "metadata.source", "count": 30, "min_per_stratum": 10 } } }, { "stage_name": "summarize", "stage_type": "reduce", "config": { "stage_id": "summarize", "parameters": { "provider": "openai", "model_name": "gpt-4o", "prompt": "Compare perspectives from different sources on this topic\n\n{{DOCUMENTS}}" } } } ] ``` ## Stratified Sampling Details ### Minimum Per Stratum ```json theme={null} { "strategy": "stratified", "stratify_by": "metadata.category", "count": 30, "min_per_stratum": 5 } ``` Each group is guaranteed at least 5 samples (if available), with the remainder allocated proportionally up to `count`. ### Proportional Allocation ```json theme={null} { "strategy": "stratified", "stratify_by": "metadata.category", "count": 30 } ``` Stratified sampling allocates samples proportional to group size by default. ## Reproducibility Use `seed` for reproducible results: ```json theme={null} { "strategy": "random", "count": 20, "seed": 12345 } ``` Same seed + same input = same output. ## Error Handling | Error | Behavior | | -------------------- | -------------------- | | count > input | Return all documents | | Empty stratum | Skip that stratum | | Invalid stratify\_by | Fall back to random | | count = 0 | Return empty | ## Sample vs Other Reduction Stages | Stage | Selection Basis | Deterministic | | ------------- | --------------------- | ------------- | | `sample` | Random/Stratified | With seed | | `limit` | Position | Yes | | `mmr` | Diversity + relevance | Yes | | `deduplicate` | Uniqueness | Yes | ## Related * [Aggregate](/retrieval/stages/aggregate) - Statistical analysis * [Group By](/retrieval/stages/group-by) - Group before sampling * [Cluster](/retrieval/stages/cluster) - Semantic grouping * [MMR](/retrieval/stages/mmr) - Diversity-based selection