The Group By stage aggregates documents that share the same value for a specified field, creating logical groups. This is useful for organizing results by category, author, date, or any other attribute.
Stage Category : GROUP (Groups documents)Transformation : N documents → G groups (where G = unique field values)
When to Use
Use Case Description Category grouping Group products by category Author aggregation Group articles by author Date grouping Group by day/month/year Source organization Group by data source
When NOT to Use
Scenario Recommended Alternative Semantic similarity grouping clusterStatistical aggregations only aggregateRemoving duplicates deduplicateTop-N per group Use with sample
Parameters
Parameter Type Default Description group_by_fieldstring source_object_idField to group by (dot notation supported) max_per_groupinteger 10Maximum documents to keep per group output_modestring allfirst (top doc per group), all (grouped), flatten (flat list)
Configuration Examples
Basic Group By
Limited Docs Per Group
Deduplicate (top doc per group)
Date Grouping
Nested Field Grouping
{
"stage_name" : "group_by" ,
"stage_type" : "group" ,
"config" : {
"stage_id" : "group_by" ,
"parameters" : {
"group_by_field" : "metadata.category"
}
}
}
Output Schema
{
"groups" : [
{
"key" : "electronics" ,
"count" : 25 ,
"documents" : [
{
"document_id" : "doc_123" ,
"content" : "Latest smartphone review..." ,
"score" : 0.95 ,
"metadata" : { "category" : "electronics" , "price" : 999 }
},
{
"document_id" : "doc_456" ,
"content" : "Laptop comparison guide..." ,
"score" : 0.89 ,
"metadata" : { "category" : "electronics" , "price" : 1299 }
}
]
},
{
"key" : "clothing" ,
"count" : 18 ,
"documents" : [ ... ]
}
],
"metadata" : {
"total_groups" : 5 ,
"total_documents" : 100 ,
"field" : "metadata.category"
}
}
Metric Value Latency 5-20ms Memory O(N) Cost Free Scalability Efficient
Common Pipeline Patterns
Search + Group by Category
[
{
"stage_name" : "semantic_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 100 }
],
"final_top_k" : 100
}
}
},
{
"stage_name" : "group_by" ,
"stage_type" : "group" ,
"config" : {
"stage_id" : "group_by" ,
"parameters" : {
"group_by_field" : "metadata.category" ,
"max_per_group" : 5
}
}
}
]
Grouped Results with Aggregations
[
{
"stage_name" : "hybrid_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 200 }
],
"final_top_k" : 200
}
}
},
{
"stage_name" : "group_by" ,
"stage_type" : "group" ,
"config" : {
"stage_id" : "group_by" ,
"parameters" : {
"group_by_field" : "metadata.brand" ,
"max_per_group" : 10
}
}
},
{
"stage_name" : "aggregate" ,
"stage_type" : "reduce" ,
"config" : {
"stage_id" : "aggregate" ,
"parameters" : {
"aggregations" : [
{ "type" : "avg" , "field" : "metadata.price" , "name" : "avg_price" },
{ "type" : "avg" , "field" : "metadata.rating" , "name" : "avg_rating" }
],
"group_by" : "metadata.brand"
}
}
}
]
Author-Grouped Search
[
{
"stage_name" : "semantic_search" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{ "feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" , "query" : { "input_mode" : "text" , "value" : "{{INPUT.query}}" }, "top_k" : 100 }
],
"final_top_k" : 100
}
}
},
{
"stage_name" : "document_enrich" ,
"stage_type" : "enrich" ,
"config" : {
"stage_id" : "document_enrich" ,
"parameters" : {
"target_collection_id" : "authors" ,
"target_field" : "author_id" ,
"source_field" : "metadata.author_id" ,
"output_field" : "author"
}
}
},
{
"stage_name" : "group_by" ,
"stage_type" : "group" ,
"config" : {
"stage_id" : "group_by" ,
"parameters" : {
"group_by_field" : "author.name" ,
"max_per_group" : 3
}
}
}
]
Time-Based Grouping
[
{
"stage_name" : "structured_filter" ,
"stage_type" : "filter" ,
"config" : {
"stage_id" : "attribute_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.date" ,
"operator" : "gte" ,
"value" : "2024-01-01"
}
}
}
},
{
"stage_name" : "code_execution" ,
"stage_type" : "apply" ,
"config" : {
"stage_id" : "code_execution" ,
"parameters" : {
"code" : "def transform(doc): \n date = doc.get('metadata', {}).get('date', '') \n doc['metadata']['month'] = date[:7] # YYYY-MM \n return doc"
}
}
},
{
"stage_name" : "group_by" ,
"stage_type" : "group" ,
"config" : {
"stage_id" : "group_by" ,
"parameters" : {
"group_by_field" : "metadata.month"
}
}
}
]
Document Sorting Within Groups
Documents within each group are automatically sorted by relevance score (highest first), then limited to max_per_group.
Output Modes
output_modeDescription all (default)Return all documents (up to max_per_group) grouped by field firstReturn only the top-scoring document per group (deduplication) flattenReturn all documents as a flat list (drops group structure)
Handling Missing Values
Documents missing the group_by_field value are grouped under a null key.
Error Handling
Error Behavior Missing field Documents grouped under “null” key Empty results Return empty groups array Invalid field path Stage fails
Group By vs Cluster
Aspect Group By Cluster Grouping basis Field value Embedding similarity Groups known Yes (field values) No (discovered) Speed Fast Slower Use case Category organization Theme discovery