Skip to content

Add rare_terms bucket aggregation #9826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

dwelsch-esi
Copy link
Contributor

Description

Add rare_terms bucket aggregation.

Issues Resolved

Version

Frontend features

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

github-actions bot commented May 6, 2025

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

@kolchfa-aws
Copy link
Collaborator

@sandeshkr419 Could you review this PR? Thanks!

@@ -59,6 +59,9 @@ GET opensearch_dashboards_sample_data_logs/_search
The values are returned with the key `key`.
`doc_count` specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of `doc-count`.

It is possible to use `terms` to search for infrequent values by ordering returned values by ascending count ( `"order": {"count": "asc")` ). We strongly discourage this practice since doing so can cause large unknown errors if multiple shards are involved. We recommend using `rare_terms` instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some more clarification here:

The issue arises because a term that is globally infrequent might not appear as infrequent on every individual shard, or it might be entirely absent from the top (or, in this case, bottom) results returned by some shards. Conversely, a term might seem infrequent on one shard but be quite common on another. In both the scenarios, the rare term might be missed in individual shard results, which can lead to incorrect overall results.


A sibling aggregation must be a multi-bucket aggregation (have multiple grouped values for a certain field) and the metric must be a numeric value.

`min_bucket`, `max_bucket`, `sum_bucket`, and `avg_bucket` are common sibling aggregations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think OpenSearch has documentation to these sibling aggregations. Might be good to list the available sibling aggregations out in a tabular format with small description.

Same applies for parent aggregations as well.

This folder wil probably have all pipeline aggregators: https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/search/aggregations/pipeline

For each aggregator builder classes, look for the NAME property, which should tell you that the corresponding name is one of the pipeline aggregation. Example: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/pipeline/AvgBucketPipelineAggregationBuilder.java#L48 tells that avg_bucket is one of the pipeline aggregation.

If you think there is a chance to improve java docs on the same, please open up an issue in core listing out the aggregations, where you think we need to improve on java docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is to include all parent and all sibling aggregations in detail like #9796, then probably a table of links might be a good follow-up instead of a tabular format.


The following parameters are optional:

- `gap_policy`: Real-world data can contain gaps or null values. You can specify the policy to deal with such missing data with the `gap_policy` property. You can either set the `gap_policy` property to `skip` to skip the missing data and continue from the next available value, or `insert_zeros` to replace the missing values with zero and continue running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just say this is used to handle data gaps and then link it to bottom section of the page where you talk about data gaps in detail.

It feels too much detail here when it is explained in more detail in another section.

@sandeshkr419
Copy link
Contributor

I see pipeline aggregations is also rewritten (which looks very neat now, kudos) in the same PR.
A minor suggestion is to make independent changes in independent PRs to avoid confusion, plus quicker to review small PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Tech review PR: Tech review in progress Content gap
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants