Add rare_terms bucket aggregation #9826

dwelsch-esi · 2025-05-06T16:18:03Z

Description

Add rare_terms bucket aggregation.

Issues Resolved

Version

Frontend features

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ve aggregations/pipeline-agg.md; Add aggregations/pipeline/index.md. Individual aggregation files added in other PRs. Signed-off-by: Dave Welsch <[email protected]>

Signed-off-by: Dave Welsch <[email protected]>

github-actions · 2025-05-06T16:18:12Z

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

kolchfa-aws · 2025-05-06T20:24:53Z

@sandeshkr419 Could you review this PR? Thanks!

sandeshkr419 · 2025-05-16T17:38:01Z

_aggregations/bucket/terms.md

@@ -59,6 +59,9 @@ GET opensearch_dashboards_sample_data_logs/_search
 The values are returned with the key `key`.
 `doc_count` specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of `doc-count`.

+It is possible to use `terms` to search for infrequent values by ordering returned values by ascending count ( `"order": {"count": "asc")` ). We strongly discourage this practice since doing so can cause large unknown errors if multiple shards are involved. We recommend using `rare_terms` instead. 


Let's add some more clarification here:

The issue arises because a term that is globally infrequent might not appear as infrequent on every individual shard, or it might be entirely absent from the top (or, in this case, bottom) results returned by some shards. Conversely, a term might seem infrequent on one shard but be quite common on another. In both the scenarios, the rare term might be missed in individual shard results, which can lead to incorrect overall results.

sandeshkr419 · 2025-05-16T17:48:49Z

_aggregations/pipeline/index.md

+
+A sibling aggregation must be a multi-bucket aggregation (have multiple grouped values for a certain field) and the metric must be a numeric value.
+
+`min_bucket`, `max_bucket`, `sum_bucket`, and `avg_bucket` are common sibling aggregations.


I don't think OpenSearch has documentation to these sibling aggregations. Might be good to list the available sibling aggregations out in a tabular format with small description.

Same applies for parent aggregations as well.

This folder wil probably have all pipeline aggregators: https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/search/aggregations/pipeline

For each aggregator builder classes, look for the NAME property, which should tell you that the corresponding name is one of the pipeline aggregation. Example: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/pipeline/AvgBucketPipelineAggregationBuilder.java#L48 tells that avg_bucket is one of the pipeline aggregation.

If you think there is a chance to improve java docs on the same, please open up an issue in core listing out the aggregations, where you think we need to improve on java docs.

If the plan is to include all parent and all sibling aggregations in detail like #9796, then probably a table of links might be a good follow-up instead of a tabular format.

sandeshkr419 · 2025-05-16T17:58:45Z

_aggregations/pipeline/index.md

+
+The following parameters are optional:
+
+- `gap_policy`: Real-world data can contain gaps or null values. You can specify the policy to deal with such missing data with the `gap_policy` property. You can either set the `gap_policy` property to `skip` to skip the missing data and continue from the next available value, or `insert_zeros` to replace the missing values with zero and continue running.


Lets just say this is used to handle data gaps and then link it to bottom section of the page where you talk about data gaps in detail.

It feels too much detail here when it is explained in more detail in another section.

sandeshkr419 · 2025-05-16T18:03:01Z

I see pipeline aggregations is also rewritten (which looks very neat now, kudos) in the same PR.
A minor suggestion is to make independent changes in independent PRs to avoid confusion, plus quicker to review small PRs.

dwelsch-esi and others added 12 commits April 10, 2025 16:00

Refactor pipeline aggregations to match other aggregation types. Remo…

8bb98c1

…ve aggregations/pipeline-agg.md; Add aggregations/pipeline/index.md. Individual aggregation files added in other PRs. Signed-off-by: Dave Welsch <[email protected]>

Merge branch 'opensearch-project:main' into main

2d0fafa

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

6116325

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

1de3636

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

39fc601

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

1000a4a

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

afd39fc

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

64732ce

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

9fb2937

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

a4cbc7e

Merge branch 'main' of github.com:dwelsch-esi/opensearch-doc-website

db06782

Add rare_terms bucket aggregation.

e4f9597

Signed-off-by: Dave Welsch <[email protected]>

dwelsch-esi requested review from kolchfa-aws, Naarcha-AWS, AMoo-Miki, natebower, dlvenable and epugh as code owners May 6, 2025 16:18

github-actions bot assigned kolchfa-aws May 6, 2025

kolchfa-aws added 3 - Tech review PR: Tech review in progress Content gap labels May 6, 2025

sandeshkr419 added this to Performance Roadmap May 12, 2025

github-project-automation bot moved this to Todo in Performance Roadmap May 12, 2025

sandeshkr419 self-assigned this May 12, 2025

sandeshkr419 moved this from Todo to In Progress in Performance Roadmap May 12, 2025

sandeshkr419 reviewed May 16, 2025

View reviewed changes

sandeshkr419 mentioned this pull request May 16, 2025

Add moving_fn pipeline aggregation #9796

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rare_terms bucket aggregation #9826

Add rare_terms bucket aggregation #9826

dwelsch-esi commented May 6, 2025

github-actions bot commented May 6, 2025

kolchfa-aws commented May 6, 2025

sandeshkr419 May 16, 2025

sandeshkr419 May 16, 2025

sandeshkr419 May 16, 2025

sandeshkr419 May 16, 2025

sandeshkr419 commented May 16, 2025


		A sibling aggregation must be a multi-bucket aggregation (have multiple grouped values for a certain field) and the metric must be a numeric value.

		`min_bucket`, `max_bucket`, `sum_bucket`, and `avg_bucket` are common sibling aggregations.


		The following parameters are optional:

		- `gap_policy`: Real-world data can contain gaps or null values. You can specify the policy to deal with such missing data with the `gap_policy` property. You can either set the `gap_policy` property to `skip` to skip the missing data and continue from the next available value, or `insert_zeros` to replace the missing values with zero and continue running.

Add rare_terms bucket aggregation #9826

Are you sure you want to change the base?

Add rare_terms bucket aggregation #9826

Conversation

dwelsch-esi commented May 6, 2025

Description

Issues Resolved

Version

Frontend features

Checklist

github-actions bot commented May 6, 2025

kolchfa-aws commented May 6, 2025

sandeshkr419 May 16, 2025

Choose a reason for hiding this comment

sandeshkr419 May 16, 2025

Choose a reason for hiding this comment

sandeshkr419 May 16, 2025

Choose a reason for hiding this comment

sandeshkr419 May 16, 2025

Choose a reason for hiding this comment

sandeshkr419 commented May 16, 2025