Skip to content

Conversation

@ogprakash
Copy link

@ogprakash ogprakash commented Jan 1, 2026

Description

Fixed cluster health API to return 404 Not Found instead of misleading 408 Request Timeout with cluster RED status when we query non-existent indices.

Problem

When requesting cluster health for a non-existent index (e.g  GET _cluster/health/fake-index-doesnt-exist), the API would:

  1. Wait for the configured timeout period (default 30 seconds)
  2. Return a 408 response with "status": "red" and "timed_out": true
  3. Misleadingly suggest the cluster is unhealthy when the actual issue is that the index doesn't exist

Solution

TransportClusterHealthAction to check if indices exist after the timeout expires. If indices are still missing after the wait period:

  • Returns IndexNotFoundException (404 status code) with a clear error message

Related Issues

Resolves #19022

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes
    • Cluster health now returns a clear "index not found" error when one or more queried indices are missing at timeout; genuine timeouts still occur only when all specified indices exist.
  • Tests
    • Tests updated to expect explicit not-found errors for missing indices; client and routing tests adjusted accordingly.
    • Added a test verifying the wait duration approaches the configured timeout when waiting for a missing index.

✏️ Tip: You can customize this high-level summary in your review settings.

@ogprakash ogprakash requested a review from a team as a code owner January 1, 2026 12:45
@github-actions github-actions bot added bug Something isn't working Other labels Jan 1, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 1, 2026

📝 Walkthrough

Walkthrough

On cluster-health timeouts, the transport action computes the final observed state and, when indices were provided, strictly resolves them; if any requested index is missing it now fails with IndexNotFoundException (404). Tests and client tests were updated to expect these exceptions and a new timeout-missing-index test was added.

Changes

Cohort / File(s) Summary
Transport timeout handling
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java
On timeout, compute final observed state and, when indices were provided, strictly resolve them; if any index is missing, fail with IndexNotFoundException (404) instead of returning a generic timed-out health response.
Server integration tests
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java, server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
Replace assertions that expected timed-out/RED responses for non-existent indices with expectations that IndexNotFoundException is thrown (messages containing no such index [<name>]); add testHealthOnMissingIndexTimeout() and adjust imports/comments.
High-level client tests
client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
Update test to expect an exception path for non-existent-index cluster-health calls: assert OpenSearchException / NOT_FOUND and presence of index_not_found_exception in the message.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant TransportAction as TransportClusterHealthAction
    participant ClusterState
    participant IndexResolver
    participant Response

    Client->>TransportAction: _cluster/health request (maybe with indices)
    TransportAction->>ClusterState: wait until condition or timeout
    ClusterState-->>TransportAction: final observed state (on timeout)

    alt indices specified
        TransportAction->>IndexResolver: resolve indices (strict expand)
        alt missing index
            IndexResolver-->>TransportAction: throws IndexNotFoundException
            TransportAction-->>Client: Error (404 IndexNotFoundException)
        else all indices present
            IndexResolver-->>TransportAction: resolved names
            TransportAction-->>Client: ClusterHealthResponse (timed_out=true, final state)
        end
    else no indices specified
        TransportAction-->>Client: ClusterHealthResponse (timed_out=true, final state)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through shards and names to peep,
Found an empty hole where indexes sleep,
No louder red—just a crisp "not found" call,
I thumped my foot and laughed with a small hop-pawl,
404 tidy — I twitched and leapt, that's all.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main change: fixing cluster health logic and updating integration tests.
Description check ✅ Passed The PR description includes all required sections: problem statement, solution, related issues (#19022), and checklist confirmation that functionality includes testing.
Linked Issues check ✅ Passed Code changes align with issue #19022: cluster health now returns 404 IndexNotFoundException for non-existent indices instead of 408 timeout with red status.
Out of Scope Changes check ✅ Passed All changes focus on fixing cluster health behavior for non-existent indices and updating related tests; no unrelated modifications detected.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22e1595 and f8e7d67.

📒 Files selected for processing (4)
  • client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
  • server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java
  • server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
  • server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java
🚧 Files skipped from review as they are similar to previous changes (2)
  • client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
  • server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (5)
server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java (2)

21-21: LGTM!

The new imports (IndexNotFoundException and containsString) are correctly added to support the updated test assertions that validate the new 404 behavior for non-existent indices.

Also applies to: 43-43


668-682: Excellent update to validate the 404 behavior for non-existent indices.

The test now correctly expects IndexNotFoundException when querying a non-existent index, aligning with the PR's objective to return HTTP 404 instead of HTTP 408 with RED status. The assertion on the exception message ensures the error clearly indicates the missing index.

server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (3)

49-49: LGTM!

The new imports (IndexNotFoundException, containsString, and greaterThanOrEqualTo) are correctly added to support the updated test assertions and the new timeout validation test.

Also applies to: 59-59, 61-61


87-122: Excellent comprehensive test coverage for the new 404 behavior.

The updated testHealth() method now validates multiple scenarios:

  1. Non-existent index "test1" → expects IndexNotFoundException
  2. Cluster-wide health (no index specified) → expects GREEN status
  3. Existing index "test1" → expects GREEN status
  4. Mix of existing "test1" and non-existent "test2" → expects IndexNotFoundException

This ensures the new behavior is consistent across different query patterns and properly distinguishes between cluster-level and index-specific health checks.


124-148: Excellent addition: validates timeout behavior for non-existent indices.

This new test specifically verifies the core fix described in the PR:

  • Confirms that when a timeout occurs while waiting for a non-existent index, the API returns IndexNotFoundException (404) instead of a timeout response with RED status (408)
  • The elapsed time check validates that the onTimeout() callback was actually triggered and properly returned the 404 exception
  • Good use of random index name to avoid test interference

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (1)

100-100: Consider breaking this long line for readability.

The line is functional, but at ~140+ characters, it may benefit from line breaks similar to the pattern used elsewhere in the file (e.g., lines 87-96).

🔎 Suggested formatting
-        ClusterHealthResponse healthResponse = client().admin().cluster().prepareHealth().setWaitForGreenStatus().setTimeout("10s").execute().actionGet();
+        ClusterHealthResponse healthResponse = client().admin()
+            .cluster()
+            .prepareHealth()
+            .setWaitForGreenStatus()
+            .setTimeout("10s")
+            .execute()
+            .actionGet();
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9b5bd1 and ff4cc79.

📒 Files selected for processing (2)
  • server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java
  • server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java (1)
libs/core/src/main/java/org/opensearch/core/common/util/CollectionUtils.java (1)
  • CollectionUtils (61-344)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (4)
server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java (1)

320-336: Logic correctly implements the PR objective for non-existent indices.

The approach of checking index existence after timeout and returning IndexNotFoundException (404) instead of a misleading timeout response is sound.

One consideration: IndicesOptions.strictExpand() is hardcoded rather than using the request's original IndicesOptions. This means:

  • Wildcard patterns matching no indices (e.g., _cluster/health/nonexistent-*) will also return 404
  • Requests with lenient options (allowing missing indices) will still get 404 on timeout

If this is intentional behavior, this is fine. Otherwise, consider checking against the request's IndicesOptions:

indexNameExpressionResolver.concreteIndexNames(finalState, request.indicesOptions(), request);
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (3)

49-49: LGTM!

Import correctly added for the new IndexNotFoundException assertions.


87-97: Test correctly validates the new 404 behavior.

The test properly verifies that requesting cluster health for a non-existent index now throws IndexNotFoundException with the expected message format.


115-125: Test correctly covers the mixed indices scenario.

This validates that when querying health for both an existing index (test1) and a non-existent index (test2), the API returns IndexNotFoundException specifically identifying the missing index.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2026

❌ Gradle check result for ff4cc79: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2026

❌ Gradle check result for aa4abf0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2026

❌ Gradle check result for a816d3f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

❌ Gradle check result for 1848ea3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

✅ Gradle check result for 7fb4ca2: SUCCESS

@codecov
Copy link

codecov bot commented Jan 2, 2026

Codecov Report

❌ Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.15%. Comparing base (e9b5bd1) to head (7fb4ca2).

Files with missing lines Patch % Lines
...n/cluster/health/TransportClusterHealthAction.java 0.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20352      +/-   ##
============================================
- Coverage     73.27%   73.15%   -0.12%     
+ Complexity    71739    71665      -74     
============================================
  Files          5785     5785              
  Lines        328143   328150       +7     
  Branches      47270    47271       +1     
============================================
- Hits         240445   240065     -380     
- Misses        68397    68870     +473     
+ Partials      19301    19215      -86     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (1)

124-148: Potential test flakiness with timing assertion.

The elapsed time assertion greaterThanOrEqualTo(1500L) with a 2s timeout may be flaky in CI environments. On a system under load, thread scheduling delays could cause the test to start measuring time slightly before the actual timeout begins, or the exception handling could complete faster than expected.

Consider using a more lenient threshold (e.g., 1000ms or 50% of timeout) to reduce flakiness while still verifying the timeout behavior:

🔎 Suggested adjustment
-        // Verify that we actually waited for the timeout (should be close to 2 seconds)
-        // This confirms the onTimeout() callback was triggered
-        assertThat("Expected timeout to be triggered", elapsedTime, greaterThanOrEqualTo(1500L));
+        // Verify that we actually waited for the timeout (should be close to 2 seconds)
+        // This confirms the onTimeout() callback was triggered. Use 50% threshold to avoid flakiness.
+        assertThat("Expected timeout to be triggered", elapsedTime, greaterThanOrEqualTo(1000L));
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb4ca2 and 22e1595.

📒 Files selected for processing (3)
  • client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
  • server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java
  • server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
🚧 Files skipped from review as they are similar to previous changes (2)
  • server/src/internalClusterTest/java/org/opensearch/cluster/routing/WeightedRoutingIT.java
  • client/rest-high-level/src/test/java/org/opensearch/client/ClusterClientIT.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check
🔇 Additional comments (4)
server/src/internalClusterTest/java/org/opensearch/cluster/ClusterHealthIT.java (4)

49-62: LGTM!

The new imports are correctly added and all are used in the test code below.


89-122: LGTM!

The test correctly validates the new behavior where cluster health requests for non-existent indices throw IndexNotFoundException with an appropriate message, instead of returning a misleading timeout with RED status. The test flow covers:

  1. Non-existent index before creation
  2. Cluster-wide health check (no specific index)
  3. Existing index after creation
  4. Mixed existing/non-existing indices

368-371: LGTM!

Minor comment reformatting for improved readability.


384-404: LGTM!

Comment reformatting improves readability without changing test logic.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

❌ Gradle check result for 22e1595: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

❌ Gradle check result for f8e7d67: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Other

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] 408 response with incorrect information returned from GET _cluster/health/<index> given nonexistent index

1 participant