Fix cluster health logic and update corresponding integration test #20352

ogprakash · 2026-01-01T12:45:20Z

Description

Fixed cluster health API to return 404 Not Found instead of misleading 408 Request Timeout with cluster RED status when we query non-existent indices.

Problem

When requesting cluster health for a non-existent index (e.g GET _cluster/health/fake-index-doesnt-exist), the API would:

Wait for the configured timeout period (default 30 seconds)
Return a 408 response with "status": "red" and "timed_out": true
Misleadingly suggest the cluster is unhealthy when the actual issue is that the index doesn't exist

Solution

TransportClusterHealthAction to check if indices exist after the timeout expires. If indices are still missing after the wait period:

Returns IndexNotFoundException (404 status code) with a clear error message

Related Issues

Resolves #19022

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

Bug Fixes
- Cluster health now returns a clear "index not found" error when one or more queried indices are missing at timeout; genuine timeouts still occur only when all specified indices exist.
Tests
- Tests updated to expect explicit not-found errors for missing indices; client and routing tests adjusted accordingly.
- Added a test verifying the wait duration approaches the configured timeout when waiting for a missing index.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-01T12:45:42Z

📝 Walkthrough

Walkthrough

On cluster-health timeouts, the transport action computes the final observed state and, when indices were provided, strictly resolves them; if any requested index is missing it now fails with IndexNotFoundException (404). Tests and client tests were updated to expect these exceptions and a new timeout-missing-index test was added.

Changes

Cohort / File(s)	Summary
Transport timeout handling `server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java`	On timeout, compute final observed state and, when indices were provided, strictly resolve them; if any index is missing, fail with `IndexNotFoundException` (404) instead of returning a generic timed-out health response.
Server integration tests `server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java`, `server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java`	Replace assertions that expected timed-out/RED responses for non-existent indices with expectations that `IndexNotFoundException` is thrown (messages containing `no such index [<name>]`); add `testHealthOnMissingIndexTimeout()` and adjust imports/comments.
High-level client tests `client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java`	Update test to expect an exception path for non-existent-index cluster-health calls: assert `OpenSearchException` / NOT_FOUND and presence of `index_not_found_exception` in the message.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant TransportAction as TransportClusterHealthAction
    participant ClusterState
    participant IndexResolver
    participant Response

    Client->>TransportAction: _cluster/health request (maybe with indices)
    TransportAction->>ClusterState: wait until condition or timeout
    ClusterState-->>TransportAction: final observed state (on timeout)

    alt indices specified
        TransportAction->>IndexResolver: resolve indices (strict expand)
        alt missing index
            IndexResolver-->>TransportAction: throws IndexNotFoundException
            TransportAction-->>Client: Error (404 IndexNotFoundException)
        else all indices present
            IndexResolver-->>TransportAction: resolved names
            TransportAction-->>Client: ClusterHealthResponse (timed_out=true, final state)
        end
    else no indices specified
        TransportAction-->>Client: ClusterHealthResponse (timed_out=true, final state)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through shards and names to peep,
Found an empty hole where indexes sleep,
No louder red—just a crisp "not found" call,
I thumped my foot and laughed with a small hop-pawl,
404 tidy — I twitched and leapt, that's all.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and specifically describes the main change: fixing cluster health logic and updating integration tests.
Description check	✅ Passed	The PR description includes all required sections: problem statement, solution, related issues (#19022), and checklist confirmation that functionality includes testing.
Linked Issues check	✅ Passed	Code changes align with issue #19022: cluster health now returns 404 IndexNotFoundException for non-existent indices instead of 408 timeout with red status.
Out of Scope Changes check	✅ Passed	All changes focus on fixing cluster health behavior for non-existent indices and updating related tests; no unrelated modifications detected.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22e1595 and f8e7d67.

📒 Files selected for processing (4)

client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java
server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java

🚧 Files skipped from review as they are similar to previous changes (2)

client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (5)

server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java (2)

21-21: LGTM!

The new imports (IndexNotFoundException and containsString) are correctly added to support the updated test assertions that validate the new 404 behavior for non-existent indices.

Also applies to: 43-43

668-682: Excellent update to validate the 404 behavior for non-existent indices.

The test now correctly expects IndexNotFoundException when querying a non-existent index, aligning with the PR's objective to return HTTP 404 instead of HTTP 408 with RED status. The assertion on the exception message ensures the error clearly indicates the missing index.

server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (3)

49-49: LGTM!

The new imports (IndexNotFoundException, containsString, and greaterThanOrEqualTo) are correctly added to support the updated test assertions and the new timeout validation test.

Also applies to: 59-59, 61-61

87-122: Excellent comprehensive test coverage for the new 404 behavior.

The updated testHealth() method now validates multiple scenarios:

Non-existent index "test1" → expects IndexNotFoundException

Cluster-wide health (no index specified) → expects GREEN status

Existing index "test1" → expects GREEN status

Mix of existing "test1" and non-existent "test2" → expects IndexNotFoundException

This ensures the new behavior is consistent across different query patterns and properly distinguishes between cluster-level and index-specific health checks.

124-148: Excellent addition: validates timeout behavior for non-existent indices.

This new test specifically verifies the core fix described in the PR:

Confirms that when a timeout occurs while waiting for a non-existent index, the API returns IndexNotFoundException (404) instead of a timeout response with RED status (408)

The elapsed time check validates that the onTimeout() callback was actually triggered and properly returned the 404 exception

Good use of random index name to avoid test interference

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (1)
100-100: Consider breaking this long line for readability.

The line is functional, but at ~140+ characters, it may benefit from line breaks similar to the pattern used elsewhere in the file (e.g., lines 87-96).
🔎 Suggested formatting
-        ClusterHealthResponse healthResponse = client().admin().cluster().prepareHealth().setWaitForGreenStatus().setTimeout("10s").execute().actionGet();
+        ClusterHealthResponse healthResponse = client().admin()
+            .cluster()
+            .prepareHealth()
+            .setWaitForGreenStatus()
+            .setTimeout("10s")
+            .execute()
+            .actionGet();

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9b5bd1 and ff4cc79.

📒 Files selected for processing (2)

server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java (1)

libs/core/src/main/java/org/opensearch/core/common/util/CollectionUtils.java (1)

CollectionUtils (61-344)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (4)

server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java (1)
320-336: Logic correctly implements the PR objective for non-existent indices.

The approach of checking index existence after timeout and returning IndexNotFoundException (404) instead of a misleading timeout response is sound.

One consideration: IndicesOptions.strictExpand() is hardcoded rather than using the request's original IndicesOptions. This means:

Wildcard patterns matching no indices (e.g., _cluster/health/nonexistent-*) will also return 404

Requests with lenient options (allowing missing indices) will still get 404 on timeout

If this is intentional behavior, this is fine. Otherwise, consider checking against the request's IndicesOptions:
indexNameExpressionResolver.concreteIndexNames(finalState, request.indicesOptions(), request);
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (3)

49-49: LGTM!

Import correctly added for the new IndexNotFoundException assertions.

87-97: Test correctly validates the new 404 behavior.

The test properly verifies that requesting cluster health for a non-existent index now throws IndexNotFoundException with the expected message format.

115-125: Test correctly covers the mixed indices scenario.

This validates that when querying health for both an existing index (test1) and a non-existent index (test2), the API returns IndexNotFoundException specifically identifying the missing index.

github-actions · 2026-01-01T12:56:19Z

❌ Gradle check result for ff4cc79: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-01T13:49:20Z

❌ Gradle check result for aa4abf0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-01T20:44:56Z

❌ Gradle check result for a816d3f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-02T08:19:57Z

❌ Gradle check result for 1848ea3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-01-02T09:59:07Z

✅ Gradle check result for 7fb4ca2: SUCCESS

codecov · 2026-01-02T09:59:30Z

Codecov Report

❌ Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.15%. Comparing base (e9b5bd1) to head (7fb4ca2).

Files with missing lines	Patch %	Lines
...n/cluster/health/TransportClusterHealthAction.java	0.00%	8 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20352      +/-   ##
============================================
- Coverage     73.27%   73.15%   -0.12%     
+ Complexity    71739    71665      -74     
============================================
  Files          5785     5785              
  Lines        328143   328150       +7     
  Branches      47270    47271       +1     
============================================
- Hits         240445   240065     -380     
- Misses        68397    68870     +473     
+ Partials      19301    19215      -86

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (1)
124-148: Potential test flakiness with timing assertion.

The elapsed time assertion greaterThanOrEqualTo(1500L) with a 2s timeout may be flaky in CI environments. On a system under load, thread scheduling delays could cause the test to start measuring time slightly before the actual timeout begins, or the exception handling could complete faster than expected.

Consider using a more lenient threshold (e.g., 1000ms or 50% of timeout) to reduce flakiness while still verifying the timeout behavior:
🔎 Suggested adjustment
-        // Verify that we actually waited for the timeout (should be close to 2 seconds)
-        // This confirms the onTimeout() callback was triggered
-        assertThat("Expected timeout to be triggered", elapsedTime, greaterThanOrEqualTo(1500L));
+        // Verify that we actually waited for the timeout (should be close to 2 seconds)
+        // This confirms the onTimeout() callback was triggered. Use 50% threshold to avoid flakiness.
+        assertThat("Expected timeout to be triggered", elapsedTime, greaterThanOrEqualTo(1000L));

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb4ca2 and 22e1595.

📒 Files selected for processing (3)

client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java
server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java

🚧 Files skipped from review as they are similar to previous changes (2)

server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (4)

server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (4)

49-62: LGTM!

The new imports are correctly added and all are used in the test code below.

89-122: LGTM!

The test correctly validates the new behavior where cluster health requests for non-existent indices throw IndexNotFoundException with an appropriate message, instead of returning a misleading timeout with RED status. The test flow covers:

Non-existent index before creation

Cluster-wide health check (no specific index)

Existing index after creation

Mixed existing/non-existing indices

368-371: LGTM!

Minor comment reformatting for improved readability.

384-404: LGTM!

Comment reformatting improves readability without changing test logic.

github-actions · 2026-01-02T10:16:03Z

❌ Gradle check result for 22e1595: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Prakash Bhardwaj <[email protected]>

github-actions · 2026-01-02T13:10:57Z

❌ Gradle check result for f8e7d67: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ogprakash requested a review from a team as a code owner January 1, 2026 12:45

github-actions bot added bug Something isn't working Other labels Jan 1, 2026

coderabbitai bot reviewed Jan 1, 2026

View reviewed changes

ogprakash force-pushed the cluster_health branch from aa4abf0 to a816d3f Compare January 1, 2026 20:39

ogprakash force-pushed the cluster_health branch from a816d3f to 1848ea3 Compare January 2, 2026 07:05

ogprakash force-pushed the cluster_health branch from 1848ea3 to 7fb4ca2 Compare January 2, 2026 08:33

ogprakash force-pushed the cluster_health branch from 7fb4ca2 to 22e1595 Compare January 2, 2026 10:05

coderabbitai bot reviewed Jan 2, 2026

View reviewed changes

Fix cluster health logic and update corresponding integration test

f8e7d67

Signed-off-by: Prakash Bhardwaj <[email protected]>

ogprakash force-pushed the cluster_health branch from 22e1595 to f8e7d67 Compare January 2, 2026 11:58

Fix cluster health logic and update corresponding integration test #20352

Are you sure you want to change the base?

Fix cluster health logic and update corresponding integration test #20352

Conversation

ogprakash commented Jan 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Related Issues

Check List

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

codecov bot commented Jan 2, 2026

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ogprakash commented Jan 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 1, 2026 •

edited

Loading