Skip to main content
Sample stage showing random and stratified sampling of results
The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories.
Stage Category: REDUCE (Samples documents)Transformation: N documents → S sampled documents (where S ≤ N)

When to Use

Use CaseDescription
Representative samplesGet a sample of large result sets
A/B testingRandom document selection
Stratified selectionEqual representation per category
Cost reductionSample before expensive operations

When NOT to Use

ScenarioRecommended Alternative
Top-N by relevancelimit or rerank
Diversity by similaritymmr
Remove duplicatesdeduplicate
All results neededSkip sampling

Parameters

ParameterTypeDefaultDescription
strategystringrandomSampling strategy: random, stratified, reservoir
countinteger10Number of documents to sample
seedintegerrandomRandom seed for reproducibility
stratify_bystringnoneField for stratified sampling
min_per_stratuminteger1Minimum samples per stratum
preserve_top_kinteger0Always keep top K by score, sample the rest

Sampling Strategies

StrategyDescriptionBest For
randomUniform random selectionGeneral sampling
stratifiedProportional samples per groupCategory balance
reservoirMemory-efficient samplingVery large result sets

Configuration Examples

{
  "stage_name": "sample",
  "stage_type": "reduce",
  "config": {
    "stage_id": "sample",
    "parameters": {
      "strategy": "random",
      "count": 20
    }
  }
}

How Sampling Works

Random Sampling

Selects documents with uniform probability:
Input: [A, B, C, D, E, F, G, H, I, J] (10 docs)
Sample(count=3): [D, G, B] (random selection)

Stratified Sampling

Ensures representation from each group:
Input:
  - Category A: [A1, A2, A3, A4, A5]
  - Category B: [B1, B2, B3]
  - Category C: [C1, C2]

Stratified(min_per_stratum=2):
  [A1, A3, B2, B1, C1, C2]

Reservoir Sampling

Memory-efficient uniform sampling for very large or streaming result sets:
Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (processed one at a time)
Reservoir(count=3): [2, 6, 9] (uniform random, single pass)

Output Schema

{
  "documents": [
    {
      "document_id": "doc_123",
      "content": "Sampled document content...",
      "score": 0.85,
      "sample": {
        "method": "stratified",
        "stratum": "electronics",
        "sample_index": 0
      }
    }
  ],
  "metadata": {
    "method": "stratified",
    "total_input": 100,
    "sample_size": 15,
    "strata": {
      "electronics": {"input": 45, "sampled": 5},
      "clothing": {"input": 35, "sampled": 5},
      "books": {"input": 20, "sampled": 5}
    }
  }
}

Performance

MetricValue
Latency< 5ms
MemoryO(N)
CostFree
ComplexityO(N) random, O(N log N) stratified

Common Pipeline Patterns

Search + Sample for Testing

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 1000 }
        ],
        "final_top_k": 1000
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 50,
        "seed": 42
      }
    }
  }
]

Balanced Category Sample

[
  {
    "stage_name": "hybrid_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 500 }
        ],
        "final_top_k": 500
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.category",
        "count": 25,
        "min_per_stratum": 5
      }
    }
  }
]

Sample Before LLM Processing

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "rerank",
    "stage_type": "sort",
    "config": {
      "stage_id": "rerank",
      "parameters": {
        "inference_name": "BAAI__bge_reranker_v2_m3",
        "top_k": 50
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "random",
        "count": 10
      }
    }
  },
  {
    "stage_name": "llm_enrichment",
    "stage_type": "enrich",
    "config": {
      "stage_id": "llm_enrich",
      "parameters": {
        "model": "gpt-4o",
        "prompt": "Extract key insights",
        "output_field": "insights"
      }
    }
  }
]

Cluster + Sample Representatives

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "cluster",
    "stage_type": "group",
    "config": {
      "stage_id": "cluster",
      "parameters": {
        "n_clusters": 10
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "cluster.cluster_id",
        "count": 20,
        "min_per_stratum": 2
      }
    }
  }
]

Multi-Source Balanced Sample

[
  {
    "stage_name": "structured_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "conditions": {
          "field": "metadata.date",
          "operator": "gte",
          "value": "2024-01-01"
        }
      }
    }
  },
  {
    "stage_name": "sample",
    "stage_type": "reduce",
    "config": {
      "stage_id": "sample",
      "parameters": {
        "strategy": "stratified",
        "stratify_by": "metadata.source",
        "count": 30,
        "min_per_stratum": 10
      }
    }
  },
  {
    "stage_name": "summarize",
    "stage_type": "reduce",
    "config": {
      "stage_id": "summarize",
      "parameters": {
        "provider": "openai",
        "model_name": "gpt-4o",
        "prompt": "Compare perspectives from different sources on this topic\n\n{{DOCUMENTS}}"
      }
    }
  }
]

Stratified Sampling Details

Minimum Per Stratum

{
  "strategy": "stratified",
  "stratify_by": "metadata.category",
  "count": 30,
  "min_per_stratum": 5
}
Each group is guaranteed at least 5 samples (if available), with the remainder allocated proportionally up to count.

Proportional Allocation

{
  "strategy": "stratified",
  "stratify_by": "metadata.category",
  "count": 30
}
Stratified sampling allocates samples proportional to group size by default.

Reproducibility

Use seed for reproducible results:
{
  "strategy": "random",
  "count": 20,
  "seed": 12345
}
Same seed + same input = same output.

Error Handling

ErrorBehavior
count > inputReturn all documents
Empty stratumSkip that stratum
Invalid stratify_byFall back to random
count = 0Return empty

Sample vs Other Reduction Stages

StageSelection BasisDeterministic
sampleRandom/StratifiedWith seed
limitPositionYes
mmrDiversity + relevanceYes
deduplicateUniquenessYes