Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.mixpeek.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Group By stage showing document aggregation by field values
The Group By stage aggregates documents that share the same value for a specified field, creating logical groups. This is useful for organizing results by category, author, date, or any other attribute.
Stage Category: GROUP (Groups documents)Transformation: N documents → G groups (where G = unique field values)

When to Use

Use CaseDescription
Category groupingGroup products by category
Author aggregationGroup articles by author
Date groupingGroup by day/month/year
Source organizationGroup by data source

When NOT to Use

ScenarioRecommended Alternative
Semantic similarity groupingcluster
Statistical aggregations onlyaggregate
Removing duplicatesdeduplicate
Top-N per groupUse with sample

Parameters

ParameterTypeDefaultDescription
fieldstringRequiredField to group by
max_groupsinteger100Maximum number of groups
sort_groups_bystringcountSort groups: count, field, score
sort_orderstringdescGroup sort order: asc, desc
docs_per_groupintegerallLimit documents per group
sort_docs_bystringscoreSort docs within group

Configuration Examples

{
  "stage_type": "group",
  "stage_id": "group_by",
  "parameters": {
    "field": "metadata.category"
  }
}

Output Schema

{
  "groups": [
    {
      "key": "electronics",
      "count": 25,
      "documents": [
        {
          "document_id": "doc_123",
          "content": "Latest smartphone review...",
          "score": 0.95,
          "metadata": {"category": "electronics", "price": 999}
        },
        {
          "document_id": "doc_456",
          "content": "Laptop comparison guide...",
          "score": 0.89,
          "metadata": {"category": "electronics", "price": 1299}
        }
      ]
    },
    {
      "key": "clothing",
      "count": 18,
      "documents": [...]
    }
  ],
  "metadata": {
    "total_groups": 5,
    "total_documents": 100,
    "field": "metadata.category"
  }
}

Performance

MetricValue
Latency5-20ms
MemoryO(N)
CostFree
ScalabilityEfficient

Common Pipeline Patterns

Search + Group by Category

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "group",
    "stage_id": "group_by",
    "parameters": {
      "field": "metadata.category",
      "docs_per_group": 5
    }
  }
]

Grouped Results with Aggregations

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 200
    }
  },
  {
    "stage_type": "group",
    "stage_id": "group_by",
    "parameters": {
      "field": "metadata.brand",
      "sort_groups_by": "count",
      "max_groups": 10
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "aggregate",
    "parameters": {
      "aggregations": [
        {"type": "avg", "field": "metadata.price", "name": "avg_price"},
        {"type": "avg", "field": "metadata.rating", "name": "avg_rating"}
      ],
      "group_by": "metadata.brand"
    }
  }
]
[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "enrich",
    "stage_id": "document_enrich",
    "parameters": {
      "collection_id": "authors",
      "lookup_field": "author_id",
      "source_field": "metadata.author_id",
      "result_field": "author"
    }
  },
  {
    "stage_type": "group",
    "stage_id": "group_by",
    "parameters": {
      "field": "author.name",
      "docs_per_group": 3,
      "sort_groups_by": "count"
    }
  }
]

Time-Based Grouping

[
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.date",
        "operator": "gte",
        "value": "2024-01-01"
      }
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "code_execution",
    "parameters": {
      "code": "def transform(doc):\n    date = doc.get('metadata', {}).get('date', '')\n    doc['metadata']['month'] = date[:7]  # YYYY-MM\n    return doc"
    }
  },
  {
    "stage_type": "group",
    "stage_id": "group_by",
    "parameters": {
      "field": "metadata.month",
      "sort_groups_by": "field",
      "sort_order": "desc"
    }
  }
]

Group Sorting Options

By Count (default)

{"sort_groups_by": "count", "sort_order": "desc"}
Groups with most documents first.

By Field Value

{"sort_groups_by": "field", "sort_order": "asc"}
Alphabetical or chronological ordering.

By Best Score

{"sort_groups_by": "score", "sort_order": "desc"}
Groups containing highest-scoring documents first.

Document Sorting Within Groups

Sort ByDescription
scoreRelevance score (default)
metadata.dateAny metadata field
_randomRandom order

Handling Missing Values

BehaviorDescription
null keyDocuments with missing field grouped as “null”
ExcludeSet exclude_null: true to skip
{
  "stage_type": "group",
  "stage_id": "group_by",
  "parameters": {
    "field": "metadata.category",
    "exclude_null": true
  }
}

Error Handling

ErrorBehavior
Missing fieldGroup as “null” or exclude
Too many groupsTruncate to max_groups
Empty resultsReturn empty groups array
Invalid field pathStage fails

Group By vs Cluster

AspectGroup ByCluster
Grouping basisField valueEmbedding similarity
Groups knownYes (field values)No (discovered)
SpeedFastSlower
Use caseCategory organizationTheme discovery