-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fix cluster health logic and update corresponding integration test #20352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughOn cluster-health timeouts, the transport action computes the final observed state and, when indices were provided, strictly resolves them; if any requested index is missing it now fails with IndexNotFoundException (404). Tests and client tests were updated to expect these exceptions and a new timeout-missing-index test was added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant TransportAction as TransportClusterHealthAction
participant ClusterState
participant IndexResolver
participant Response
Client->>TransportAction: _cluster/health request (maybe with indices)
TransportAction->>ClusterState: wait until condition or timeout
ClusterState-->>TransportAction: final observed state (on timeout)
alt indices specified
TransportAction->>IndexResolver: resolve indices (strict expand)
alt missing index
IndexResolver-->>TransportAction: throws IndexNotFoundException
TransportAction-->>Client: Error (404 IndexNotFoundException)
else all indices present
IndexResolver-->>TransportAction: resolved names
TransportAction-->>Client: ClusterHealthResponse (timed_out=true, final state)
end
else no indices specified
TransportAction-->>Client: ClusterHealthResponse (timed_out=true, final state)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (1)
100-100: Consider breaking this long line for readability.The line is functional, but at ~140+ characters, it may benefit from line breaks similar to the pattern used elsewhere in the file (e.g., lines 87-96).
🔎 Suggested formatting
- ClusterHealthResponse healthResponse = client().admin().cluster().prepareHealth().setWaitForGreenStatus().setTimeout("10s").execute().actionGet(); + ClusterHealthResponse healthResponse = client().admin() + .cluster() + .prepareHealth() + .setWaitForGreenStatus() + .setTimeout("10s") + .execute() + .actionGet();
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.javaserver/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java (1)
libs/core/src/main/java/org/opensearch/core/common/util/CollectionUtils.java (1)
CollectionUtils(61-344)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
🔇 Additional comments (4)
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java (1)
320-336: Logic correctly implements the PR objective for non-existent indices.The approach of checking index existence after timeout and returning
IndexNotFoundException(404) instead of a misleading timeout response is sound.One consideration:
IndicesOptions.strictExpand()is hardcoded rather than using the request's originalIndicesOptions. This means:
- Wildcard patterns matching no indices (e.g.,
_cluster/health/nonexistent-*) will also return 404- Requests with lenient options (allowing missing indices) will still get 404 on timeout
If this is intentional behavior, this is fine. Otherwise, consider checking against the request's
IndicesOptions:indexNameExpressionResolver.concreteIndexNames(finalState, request.indicesOptions(), request);server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (3)
49-49: LGTM!Import correctly added for the new
IndexNotFoundExceptionassertions.
87-97: Test correctly validates the new 404 behavior.The test properly verifies that requesting cluster health for a non-existent index now throws
IndexNotFoundExceptionwith the expected message format.
115-125: Test correctly covers the mixed indices scenario.This validates that when querying health for both an existing index (
test1) and a non-existent index (test2), the API returnsIndexNotFoundExceptionspecifically identifying the missing index.
|
❌ Gradle check result for ff4cc79: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for aa4abf0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
aa4abf0 to
a816d3f
Compare
|
❌ Gradle check result for a816d3f: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
a816d3f to
1848ea3
Compare
|
❌ Gradle check result for 1848ea3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
1848ea3 to
7fb4ca2
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20352 +/- ##
============================================
- Coverage 73.27% 73.15% -0.12%
+ Complexity 71739 71665 -74
============================================
Files 5785 5785
Lines 328143 328150 +7
Branches 47270 47271 +1
============================================
- Hits 240445 240065 -380
- Misses 68397 68870 +473
+ Partials 19301 19215 -86 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7fb4ca2 to
22e1595
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (1)
124-148: Potential test flakiness with timing assertion.The elapsed time assertion
greaterThanOrEqualTo(1500L)with a 2s timeout may be flaky in CI environments. On a system under load, thread scheduling delays could cause the test to start measuring time slightly before the actual timeout begins, or the exception handling could complete faster than expected.Consider using a more lenient threshold (e.g., 1000ms or 50% of timeout) to reduce flakiness while still verifying the timeout behavior:
🔎 Suggested adjustment
- // Verify that we actually waited for the timeout (should be close to 2 seconds) - // This confirms the onTimeout() callback was triggered - assertThat("Expected timeout to be triggered", elapsedTime, greaterThanOrEqualTo(1500L)); + // Verify that we actually waited for the timeout (should be close to 2 seconds) + // This confirms the onTimeout() callback was triggered. Use 50% threshold to avoid flakiness. + assertThat("Expected timeout to be triggered", elapsedTime, greaterThanOrEqualTo(1000L));
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.javaserver/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.javaserver/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
🚧 Files skipped from review as they are similar to previous changes (2)
- server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
- client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
🔇 Additional comments (4)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (4)
49-62: LGTM!The new imports are correctly added and all are used in the test code below.
89-122: LGTM!The test correctly validates the new behavior where cluster health requests for non-existent indices throw
IndexNotFoundExceptionwith an appropriate message, instead of returning a misleading timeout with RED status. The test flow covers:
- Non-existent index before creation
- Cluster-wide health check (no specific index)
- Existing index after creation
- Mixed existing/non-existing indices
368-371: LGTM!Minor comment reformatting for improved readability.
384-404: LGTM!Comment reformatting improves readability without changing test logic.
|
❌ Gradle check result for 22e1595: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Prakash Bhardwaj <[email protected]>
22e1595 to
f8e7d67
Compare
|
❌ Gradle check result for f8e7d67: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Fixed cluster health API to return 404 Not Found instead of misleading 408 Request Timeout with cluster RED status when we query non-existent indices.
Problem
When requesting cluster health for a non-existent index (e.g GET _cluster/health/fake-index-doesnt-exist), the API would:
Solution
TransportClusterHealthAction to check if indices exist after the timeout expires. If indices are still missing after the wait period:
Related Issues
Resolves #19022
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.