The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories.
Stage Category : REDUCE (Samples documents)Transformation : N documents → S sampled documents (where S ≤ N)
When to Use
Use Case Description Representative samples Get a sample of large result sets A/B testing Random document selection Stratified selection Equal representation per category Cost reduction Sample before expensive operations
When NOT to Use
Scenario Recommended Alternative Top-N by relevance limit or rerankDiversity by similarity mmrRemove duplicates deduplicateAll results needed Skip sampling
Parameters
Parameter Type Default Description strategystring randomSampling strategy: random, stratified, reservoir countinteger 10Number of documents to sample seedinteger random Random seed for reproducibility stratify_bystring none Field for stratified sampling min_per_stratuminteger 1Minimum samples per stratum preserve_top_kinteger 0Always keep top K by score, sample the rest
Sampling Strategies
Strategy Description Best For randomUniform random selection General sampling stratifiedProportional samples per group Category balance reservoirMemory-efficient sampling Very large result sets
Configuration Examples
Random Sample
Reproducible Random Sample
Stratified by Category
Reservoir Sampling
Stratified with Total Limit
{
"stage_name" : "sample" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "sample" ,
"parameters" : {
"strategy" : "random" ,
"count" : 20
}
}
}
How Sampling Works
Random Sampling
Selects documents with uniform probability:
Input: [A, B, C, D, E, F, G, H, I, J] (10 docs)
Sample(count=3): [D, G, B] (random selection)
Stratified Sampling
Ensures representation from each group:
Input:
- Category A: [A1, A2, A3, A4, A5]
- Category B: [B1, B2, B3]
- Category C: [C1, C2]
Stratified(min_per_stratum=2):
[A1, A3, B2, B1, C1, C2]
Reservoir Sampling
Memory-efficient uniform sampling for very large or streaming result sets:
Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (processed one at a time)
Reservoir(count=3): [2, 6, 9] (uniform random, single pass)
Output Schema
{
"documents" : [
{
"document_id" : "doc_123" ,
"content" : "Sampled document content..." ,
"score" : 0.85 ,
"sample" : {
"method" : "stratified" ,
"stratum" : "electronics" ,
"sample_index" : 0
}
}
],
"metadata" : {
"method" : "stratified" ,
"total_input" : 100 ,
"sample_size" : 15 ,
"strata" : {
"electronics" : { "input" : 45 , "sampled" : 5 },
"clothing" : { "input" : 35 , "sampled" : 5 },
"books" : { "input" : 20 , "sampled" : 5 }
}
}
}
Metric Value Latency < 5ms Memory O(N) Cost Free Complexity O(N) random, O(N log N) stratified
Common Pipeline Patterns
Search + Sample for Testing
[
{
"stage_name" : "semantic_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 1000 }
],
"final_top_k" : 1000
}
}
},
{
"stage_name" : "sample" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "sample" ,
"parameters" : {
"strategy" : "random" ,
"count" : 50 ,
"seed" : 42
}
}
}
]
Balanced Category Sample
[
{
"stage_name" : "hybrid_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 500 }
],
"final_top_k" : 500
}
}
},
{
"stage_name" : "sample" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "sample" ,
"parameters" : {
"strategy" : "stratified" ,
"stratify_by" : "metadata.category" ,
"count" : 25 ,
"min_per_stratum" : 5
}
}
}
]
Sample Before LLM Processing
[
{
"stage_name" : "semantic_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 200 }
],
"final_top_k" : 200
}
}
},
{
"stage_name" : "rerank" ,
"stage_type" : "sort" ,
"config" : {
"stage_id" : "rerank" ,
"parameters" : {
"inference_name" : "BAAI__bge_reranker_v2_m3" ,
"top_k" : 50
}
}
},
{
"stage_name" : "sample" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "sample" ,
"parameters" : {
"strategy" : "random" ,
"count" : 10
}
}
},
{
"stage_name" : "llm_enrichment" ,
"stage_type" : "enrich" ,
"config" : {
"stage_id" : "llm_enrich" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Extract key insights" ,
"output_field" : "insights"
}
}
}
]
Cluster + Sample Representatives
[
{
"stage_name" : "semantic_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 200 }
],
"final_top_k" : 200
}
}
},
{
"stage_name" : "cluster" ,
"stage_type" : "group" ,
"config" : {
"stage_id" : "cluster" ,
"parameters" : {
"n_clusters" : 10
}
}
},
{
"stage_name" : "sample" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "sample" ,
"parameters" : {
"strategy" : "stratified" ,
"stratify_by" : "cluster.cluster_id" ,
"count" : 20 ,
"min_per_stratum" : 2
}
}
}
]
Multi-Source Balanced Sample
[
{
"stage_name" : "structured_filter" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "attribute_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.date" ,
"operator" : "gte" ,
"value" : "2024-01-01"
}
}
}
},
{
"stage_name" : "sample" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "sample" ,
"parameters" : {
"strategy" : "stratified" ,
"stratify_by" : "metadata.source" ,
"count" : 30 ,
"min_per_stratum" : 10
}
}
},
{
"stage_name" : "summarize" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "summarize" ,
"parameters" : {
"provider" : "openai" ,
"model_name" : "gpt-4o" ,
"prompt" : "Compare perspectives from different sources on this topic \n\n {{DOCUMENTS}}"
}
}
}
]
Stratified Sampling Details
Minimum Per Stratum
{
"strategy" : "stratified" ,
"stratify_by" : "metadata.category" ,
"count" : 30 ,
"min_per_stratum" : 5
}
Each group is guaranteed at least 5 samples (if available), with the remainder allocated proportionally up to count.
Proportional Allocation
{
"strategy" : "stratified" ,
"stratify_by" : "metadata.category" ,
"count" : 30
}
Stratified sampling allocates samples proportional to group size by default.
Reproducibility
Use seed for reproducible results:
{
"strategy" : "random" ,
"count" : 20 ,
"seed" : 12345
}
Same seed + same input = same output.
Error Handling
Error Behavior count > input Return all documents Empty stratum Skip that stratum Invalid stratify_by Fall back to random count = 0 Return empty
Sample vs Other Reduction Stages
Stage Selection Basis Deterministic sampleRandom/Stratified With seed limitPosition Yes mmrDiversity + relevance Yes deduplicateUniqueness Yes