Skip to main content
Group By stage showing document aggregation by field values
The Group By stage aggregates documents that share the same value for a specified field, creating logical groups. This is useful for organizing results by category, author, date, or any other attribute.
Stage Category: GROUP (Groups documents)Transformation: N documents → G groups (where G = unique field values)

When to Use

Use CaseDescription
Category groupingGroup products by category
Author aggregationGroup articles by author
Date groupingGroup by day/month/year
Source organizationGroup by data source

When NOT to Use

ScenarioRecommended Alternative
Semantic similarity groupingcluster
Statistical aggregations onlyaggregate
Removing duplicatesdeduplicate
Top-N per groupUse with sample

Parameters

ParameterTypeDefaultDescription
group_by_fieldstringsource_object_idField to group by (dot notation supported)
max_per_groupinteger10Maximum documents to keep per group
output_modestringallfirst (top doc per group), all (grouped), flatten (flat list)

Configuration Examples

{
  "stage_name": "group_by",
  "stage_type": "group",
  "config": {
    "stage_id": "group_by",
    "parameters": {
      "group_by_field": "metadata.category"
    }
  }
}

Output Schema

{
  "groups": [
    {
      "key": "electronics",
      "count": 25,
      "documents": [
        {
          "document_id": "doc_123",
          "content": "Latest smartphone review...",
          "score": 0.95,
          "metadata": {"category": "electronics", "price": 999}
        },
        {
          "document_id": "doc_456",
          "content": "Laptop comparison guide...",
          "score": 0.89,
          "metadata": {"category": "electronics", "price": 1299}
        }
      ]
    },
    {
      "key": "clothing",
      "count": 18,
      "documents": [...]
    }
  ],
  "metadata": {
    "total_groups": 5,
    "total_documents": 100,
    "field": "metadata.category"
  }
}

Performance

MetricValue
Latency5-20ms
MemoryO(N)
CostFree
ScalabilityEfficient

Common Pipeline Patterns

Search + Group by Category

[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 100 }
        ],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "group_by",
    "stage_type": "group",
    "config": {
      "stage_id": "group_by",
      "parameters": {
        "group_by_field": "metadata.category",
        "max_per_group": 5
      }
    }
  }
]

Grouped Results with Aggregations

[
  {
    "stage_name": "hybrid_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 200 }
        ],
        "final_top_k": 200
      }
    }
  },
  {
    "stage_name": "group_by",
    "stage_type": "group",
    "config": {
      "stage_id": "group_by",
      "parameters": {
        "group_by_field": "metadata.brand",
        "max_per_group": 10
      }
    }
  },
  {
    "stage_name": "aggregate",
    "stage_type": "reduce",
    "config": {
      "stage_id": "aggregate",
      "parameters": {
        "aggregations": [
          {"type": "avg", "field": "metadata.price", "name": "avg_price"},
          {"type": "avg", "field": "metadata.rating", "name": "avg_rating"}
        ],
        "group_by": "metadata.brand"
      }
    }
  }
]
[
  {
    "stage_name": "semantic_search",
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [
          { "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1", "query": { "input_mode": "text", "value": "{{INPUT.query}}" }, "top_k": 100 }
        ],
        "final_top_k": 100
      }
    }
  },
  {
    "stage_name": "document_enrich",
    "stage_type": "enrich",
    "config": {
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "authors",
        "target_field": "author_id",
        "source_field": "metadata.author_id",
        "output_field": "author"
      }
    }
  },
  {
    "stage_name": "group_by",
    "stage_type": "group",
    "config": {
      "stage_id": "group_by",
      "parameters": {
        "group_by_field": "author.name",
        "max_per_group": 3
      }
    }
  }
]

Time-Based Grouping

[
  {
    "stage_name": "structured_filter",
    "stage_type": "filter",
    "config": {
      "stage_id": "attribute_filter",
      "parameters": {
        "conditions": {
          "field": "metadata.date",
          "operator": "gte",
          "value": "2024-01-01"
        }
      }
    }
  },
  {
    "stage_name": "code_execution",
    "stage_type": "apply",
    "config": {
      "stage_id": "code_execution",
      "parameters": {
        "code": "def transform(doc):\n    date = doc.get('metadata', {}).get('date', '')\n    doc['metadata']['month'] = date[:7]  # YYYY-MM\n    return doc"
      }
    }
  },
  {
    "stage_name": "group_by",
    "stage_type": "group",
    "config": {
      "stage_id": "group_by",
      "parameters": {
        "group_by_field": "metadata.month"
      }
    }
  }
]

Document Sorting Within Groups

Documents within each group are automatically sorted by relevance score (highest first), then limited to max_per_group.

Output Modes

output_modeDescription
all (default)Return all documents (up to max_per_group) grouped by field
firstReturn only the top-scoring document per group (deduplication)
flattenReturn all documents as a flat list (drops group structure)

Handling Missing Values

Documents missing the group_by_field value are grouped under a null key.

Error Handling

ErrorBehavior
Missing fieldDocuments grouped under “null” key
Empty resultsReturn empty groups array
Invalid field pathStage fails

Group By vs Cluster

AspectGroup ByCluster
Grouping basisField valueEmbedding similarity
Groups knownYes (field values)No (discovered)
SpeedFastSlower
Use caseCategory organizationTheme discovery