Skip to content

[Feature Request] Implement "multivalue_doc_count" aggregation for counting documents with multi-valued fieldsΒ #20329

@LantaoJin

Description

@LantaoJin

Is your feature request related to a problem? Please describe

We need a new type of aggregation called "multivalue_doc_count" that calculates the number of documents that have a multi-valued field. This is different from the existing "value_count" aggregation, which counts the total number of values across all documents.

User Case:

Imagine a scenario where we have an e-commerce website that stores product data in an OpenSearch index. Each product document has a field called "categories" that can contain multiple values (e.g., ["Electronics", "Computers", "Laptops"]). We want to analyze the number of products that belong to multiple categories.

Currently, if we use the "value_count" aggregation on the "categories" field, it will give us the total count of all category values across all product documents. However, this doesn't provide insight into how many products actually have multiple categories assigned.

With the "multivalue_doc_count" aggregation, we can directly get the count of documents (products) that have more than one value in the "categories" field. This information can be useful for various purposes, such as:

  1. Identifying popular product categories that often overlap with other categories.
  2. Optimizing product categorization and navigation on the website.
  3. Analyzing the complexity of product categorization for inventory management or search relevance tuning.

Example:

Given an index with the following documents:

{"id": 1, "categories": ["Electronics", "Computers"]}
{"id": 2, "categories": ["Books"]}
{"id": 3, "categories": ["Electronics", "Phones", "Accessories"]}
{"id": 4, "categories": ["Clothing"]}

The "value_count" aggregation on the "categories" field would return 7 (total values across all documents).
The "multivalue_doc_count" aggregation on the "categories" field would return 2 (two documents have multiple category values).

Describe the solution you'd like

Please consider implementing this new aggregation to enhance the analytical capabilities of OpenSearch for multi-valued fields.

{
  "size": 0,
   "aggs": {
    "number_of_array_docs": {
      "multivalue_doc_count": {
        "field": "categories"
      }
    }
  }
}

Acceptance Criteria:

  • The "multivalue_doc_count" aggregation should work with both single-valued and multi-valued fields.
  • It should correctly count the number of documents that have more than one value in the specified field.
  • The aggregation should support nested fields and scripted fields.
  • Appropriate documentation and test cases should be included.

Related component

Search:Aggregations

Describe alternatives you've considered

The alternative of multivalue_doc_count is calculated by totalCount - missingCount - totalValues

{
  "size": 0,
  "aggs": {
    "totalCount": {
      "value_count": {
        "field": "_index"
      }
    },
    "missingCount": {
      "missing": {
        "field": "categories"
      }
    },
    "totalValues":{
      "value_count": {
        "field": "categories"
      }
    }
  }
}

Additional context

SQL plugin can leverage the multivalue_doc_count aggregation to determine a field is an array or not, Since from the index mapping API, SQL plugin cannot tell the field in SQL schema is an array. In the above example, SQL plugin identified the categories as String type instead of Array[String].

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    πŸ†• New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions