-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Is your feature request related to a problem? Please describe
We need a new type of aggregation called "multivalue_doc_count" that calculates the number of documents that have a multi-valued field. This is different from the existing "value_count" aggregation, which counts the total number of values across all documents.
User Case:
Imagine a scenario where we have an e-commerce website that stores product data in an OpenSearch index. Each product document has a field called "categories" that can contain multiple values (e.g., ["Electronics", "Computers", "Laptops"]). We want to analyze the number of products that belong to multiple categories.
Currently, if we use the "value_count" aggregation on the "categories" field, it will give us the total count of all category values across all product documents. However, this doesn't provide insight into how many products actually have multiple categories assigned.
With the "multivalue_doc_count" aggregation, we can directly get the count of documents (products) that have more than one value in the "categories" field. This information can be useful for various purposes, such as:
- Identifying popular product categories that often overlap with other categories.
- Optimizing product categorization and navigation on the website.
- Analyzing the complexity of product categorization for inventory management or search relevance tuning.
Example:
Given an index with the following documents:
{"id": 1, "categories": ["Electronics", "Computers"]}
{"id": 2, "categories": ["Books"]}
{"id": 3, "categories": ["Electronics", "Phones", "Accessories"]}
{"id": 4, "categories": ["Clothing"]}
The "value_count" aggregation on the "categories" field would return 7 (total values across all documents).
The "multivalue_doc_count" aggregation on the "categories" field would return 2 (two documents have multiple category values).
Describe the solution you'd like
Please consider implementing this new aggregation to enhance the analytical capabilities of OpenSearch for multi-valued fields.
{
"size": 0,
"aggs": {
"number_of_array_docs": {
"multivalue_doc_count": {
"field": "categories"
}
}
}
}
Acceptance Criteria:
- The "multivalue_doc_count" aggregation should work with both single-valued and multi-valued fields.
- It should correctly count the number of documents that have more than one value in the specified field.
- The aggregation should support nested fields and scripted fields.
- Appropriate documentation and test cases should be included.
Related component
Search:Aggregations
Describe alternatives you've considered
The alternative of multivalue_doc_count is calculated by totalCount - missingCount - totalValues
{
"size": 0,
"aggs": {
"totalCount": {
"value_count": {
"field": "_index"
}
},
"missingCount": {
"missing": {
"field": "categories"
}
},
"totalValues":{
"value_count": {
"field": "categories"
}
}
}
}
Additional context
SQL plugin can leverage the multivalue_doc_count aggregation to determine a field is an array or not, Since from the index mapping API, SQL plugin cannot tell the field in SQL schema is an array. In the above example, SQL plugin identified the categories as String type instead of Array[String].
Metadata
Metadata
Assignees
Labels
Type
Projects
Status