-
Notifications
You must be signed in to change notification settings - Fork 558
Add rare_terms bucket aggregation #9826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ve aggregations/pipeline-agg.md; Add aggregations/pipeline/index.md. Individual aggregation files added in other PRs. Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Dave Welsch <[email protected]>
Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged. Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer. When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review. |
@sandeshkr419 Could you review this PR? Thanks! |
@@ -59,6 +59,9 @@ GET opensearch_dashboards_sample_data_logs/_search | |||
The values are returned with the key `key`. | |||
`doc_count` specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of `doc-count`. | |||
|
|||
It is possible to use `terms` to search for infrequent values by ordering returned values by ascending count ( `"order": {"count": "asc")` ). We strongly discourage this practice since doing so can cause large unknown errors if multiple shards are involved. We recommend using `rare_terms` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some more clarification here:
The issue arises because a term that is globally infrequent might not appear as infrequent on every individual shard, or it might be entirely absent from the top (or, in this case, bottom) results returned by some shards. Conversely, a term might seem infrequent on one shard but be quite common on another. In both the scenarios, the rare term might be missed in individual shard results, which can lead to incorrect overall results.
|
||
A sibling aggregation must be a multi-bucket aggregation (have multiple grouped values for a certain field) and the metric must be a numeric value. | ||
|
||
`min_bucket`, `max_bucket`, `sum_bucket`, and `avg_bucket` are common sibling aggregations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think OpenSearch has documentation to these sibling aggregations. Might be good to list the available sibling aggregations out in a tabular format with small description.
Same applies for parent aggregations as well.
This folder wil probably have all pipeline aggregators: https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/search/aggregations/pipeline
For each aggregator builder classes, look for the NAME
property, which should tell you that the corresponding name is one of the pipeline aggregation. Example: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/pipeline/AvgBucketPipelineAggregationBuilder.java#L48 tells that avg_bucket
is one of the pipeline aggregation.
If you think there is a chance to improve java docs on the same, please open up an issue in core listing out the aggregations, where you think we need to improve on java docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the plan is to include all parent and all sibling aggregations in detail like #9796, then probably a table of links might be a good follow-up instead of a tabular format.
|
||
The following parameters are optional: | ||
|
||
- `gap_policy`: Real-world data can contain gaps or null values. You can specify the policy to deal with such missing data with the `gap_policy` property. You can either set the `gap_policy` property to `skip` to skip the missing data and continue from the next available value, or `insert_zeros` to replace the missing values with zero and continue running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just say this is used to handle data gaps
and then link it to bottom section of the page where you talk about data gaps in detail.
It feels too much detail here when it is explained in more detail in another section.
I see pipeline aggregations is also rewritten (which looks very neat now, kudos) in the same PR. |
Description
Add rare_terms bucket aggregation.
Issues Resolved
Version
Frontend features
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.