Sample

Sample stage showing random and stratified sampling of results

The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories.

Stage Category: REDUCE (Samples documents)Transformation: N documents → S sampled documents (where S ≤ N)

When to Use

Use Case	Description
Representative samples	Get a sample of large result sets
A/B testing	Random document selection
Stratified selection	Equal representation per category
Cost reduction	Sample before expensive operations

When NOT to Use

Scenario	Recommended Alternative
Top-N by relevance	`limit` or `rerank`
Diversity by similarity	`mmr`
Remove duplicates	`deduplicate`
All results needed	Skip sampling

Parameters

Parameter	Type	Default	Description
`strategy`	string	`random`	Sampling strategy: `random`, `stratified`, `reservoir`
`count`	integer	`10`	Number of documents to sample
`seed`	integer	random	Random seed for reproducibility
`stratify_by`	string	none	Field for stratified sampling
`min_per_stratum`	integer	`1`	Minimum samples per stratum
`preserve_top_k`	integer	`0`	Always keep top K by score, sample the rest

Sampling Strategies

Strategy	Description	Best For
`random`	Uniform random selection	General sampling
`stratified`	Proportional samples per group	Category balance
`reservoir`	Memory-efficient sampling	Very large result sets

Configuration Examples

{
  "stage_name": "sample",
  "stage_type": "reduce",
  "config": {
    "stage_id": "sample",
    "parameters": {
      "strategy": "random",
      "count": 20
    }
  }
}

{
  "stage_name": "sample",
  "stage_type": "reduce",
  "config": {
    "stage_id": "sample",
    "parameters": {
      "strategy": "random",
      "count": 10,
      "seed": 42
    }
  }
}

{
  "stage_name": "sample",
  "stage_type": "reduce",
  "config": {
    "stage_id": "sample",
    "parameters": {
      "strategy": "stratified",
      "stratify_by": "metadata.category",
      "count": 30,
      "min_per_stratum": 3
    }
  }
}

{
  "stage_name": "sample",
  "stage_type": "reduce",
  "config": {
    "stage_id": "sample",
    "parameters": {
      "strategy": "reservoir",
      "count": 25
    }
  }
}

{
  "stage_name": "sample",
  "stage_type": "reduce",
  "config": {
    "stage_id": "sample",
    "parameters": {
      "strategy": "stratified",
      "stratify_by": "metadata.source",
      "count": 30
    }
  }
}

How Sampling Works

Random Sampling

Selects documents with uniform probability:

Input: [A, B, C, D, E, F, G, H, I, J] (10 docs)
Sample(count=3): [D, G, B] (random selection)

Stratified Sampling

Ensures representation from each group:

Input:
  - Category A: [A1, A2, A3, A4, A5]
  - Category B: [B1, B2, B3]
  - Category C: [C1, C2]

Stratified(min_per_stratum=2):
  [A1, A3, B2, B1, C1, C2]

Reservoir Sampling

Memory-efficient uniform sampling for very large or streaming result sets:

Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (processed one at a time)
Reservoir(count=3): [2, 6, 9] (uniform random, single pass)

Output Schema

{
  "documents": [
    {
      "document_id": "doc_123",
      "content": "Sampled document content...",
      "score": 0.85,
      "sample": {
        "method": "stratified",
        "stratum": "electronics",
        "sample_index": 0
      }
    }
  ],
  "metadata": {
    "method": "stratified",
    "total_input": 100,
    "sample_size": 15,
    "strata": {
      "electronics": {"input": 45, "sampled": 5},
      "clothing": {"input": 35, "sampled": 5},
      "books": {"input": 20, "sampled": 5}
    }
  }
}

Performance

Metric	Value
Latency	< 5ms
Memory	O(N)
Cost	Free
Complexity	O(N) random, O(N log N) stratified

Common Pipeline Patterns

Search + Sample for Testing

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 1000 }
        ],
        "final_top_k": 1000
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 50,
        "seed": 42
      }
    }
  }
]

Balanced Category Sample

[
  {
    "stage_name": "hybrid_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 500 }
        ],
        "final_top_k": 500
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.category",
        "count": 25,
        "min_per_stratum": 5
      }
    }
  }
]

Sample Before LLM Processing

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "top_k": 50
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 10
      }
    }
  },
  {
    "stage_name": "llm_enrichment",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "model": "gpt-4o",
        "prompt": "Extract key insights",
        "output_field": "insights"
      }
    }
  }
]

Cluster + Sample Representatives

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 10
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "cluster.cluster_id",
        "count": 20,
        "min_per_stratum": 2
      }
    }
  }
]

Multi-Source Balanced Sample

[
  {
    "stage_name": "structured_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "conditions": {
          "field": "metadata.date",
          "operator": "gte",
          "value": "2024-01-01"
        }
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.source",
        "count": 30,
        "min_per_stratum": 10
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o",
        "prompt": "Compare perspectives from different sources on this topic\n\n{{DOCUMENTS}}"
      }
    }
  }
]

Stratified Sampling Details

Minimum Per Stratum

{
  "strategy": "stratified",
  "stratify_by": "metadata.category",
  "count": 30,
  "min_per_stratum": 5
}

Each group is guaranteed at least 5 samples (if available), with the remainder allocated proportionally up to count.

Proportional Allocation

{
  "strategy": "stratified",
  "stratify_by": "metadata.category",
  "count": 30
}

Stratified sampling allocates samples proportional to group size by default.

Reproducibility

Use seed for reproducible results:

{
  "strategy": "random",
  "count": 20,
  "seed": 12345
}

Same seed + same input = same output.

Error Handling

Error	Behavior
count > input	Return all documents
Empty stratum	Skip that stratum
Invalid stratify_by	Fall back to random
count = 0	Return empty

Sample vs Other Reduction Stages

Stage	Selection Basis	Deterministic
`sample`	Random/Stratified	With seed
`limit`	Position	Yes
`mmr`	Diversity + relevance	Yes
`deduplicate`	Uniqueness	Yes

Aggregate - Statistical analysis
Group By - Group before sampling
Cluster - Semantic grouping
MMR - Diversity-based selection

Get started

Connect your data

Extract features

Build retrievers

Enrich & organize

Integrate & operate

Resources

When to Use

When NOT to Use

Parameters

Sampling Strategies

Configuration Examples

How Sampling Works

Random Sampling

Stratified Sampling

Reservoir Sampling

Output Schema

Performance

Common Pipeline Patterns

Search + Sample for Testing

Balanced Category Sample

Sample Before LLM Processing

Cluster + Sample Representatives

Multi-Source Balanced Sample

Stratified Sampling Details

Minimum Per Stratum

Proportional Allocation

Reproducibility

Error Handling

Sample vs Other Reduction Stages

​When to Use

​When NOT to Use

​Parameters

​Sampling Strategies

​Configuration Examples

​How Sampling Works

​Random Sampling

​Stratified Sampling

​Reservoir Sampling

​Output Schema

​Performance

​Common Pipeline Patterns

​Search + Sample for Testing

​Balanced Category Sample

​Sample Before LLM Processing

​Cluster + Sample Representatives

​Multi-Source Balanced Sample

​Stratified Sampling Details

​Minimum Per Stratum

​Proportional Allocation

​Reproducibility

​Error Handling

​Sample vs Other Reduction Stages

​Related

When to Use

When NOT to Use

Parameters

Sampling Strategies

Configuration Examples

How Sampling Works

Random Sampling

Stratified Sampling

Reservoir Sampling

Output Schema

Performance

Common Pipeline Patterns

Search + Sample for Testing

Balanced Category Sample

Sample Before LLM Processing

Cluster + Sample Representatives

Multi-Source Balanced Sample

Stratified Sampling Details

Minimum Per Stratum

Proportional Allocation

Reproducibility

Error Handling

Sample vs Other Reduction Stages

Related