Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove node IP from excluded list when draining fails #423

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

A-Kamaee
Copy link
Contributor

@A-Kamaee A-Kamaee commented Jun 11, 2024

In case of unsuccessful drain of a node we remove the node IP from the elastic search excluded IPs.

Issue : #128 #404

Description

A few sentences describing the overall goals of the pull request's
commits.

Types of Changes

What types of changes does your code introduce? Keep the ones that apply:

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Configuration change
  • Refactor/improvements
  • Documentation / non-code

Tasks

List of tasks you will do to complete the PR

  • Created Task 1
  • Created Task 2
  • To-do Task 3

Review

List of tasks the reviewer must do to review the PR

  • Tests
  • Documentation
  • CHANGELOG

Deployment Notes

These should highlight any db migrations, feature toggles, etc.

@girishc13
Copy link

The draining annotation needs to be removed otherwise the next run will start draining the pod again.

@juhovuori
Copy link

This fixes the clean variant of the described bug. However, we are still left with dirty exclude list in a messy case, when the process just crashes during the draining loop. It's a fairly realistic scenario as the draining loop potentially runs for hours.

@otrosien
Copy link
Member

Would it make sense to have es-operator do log a warning and kill of the node in case each of the shards have another copy outside this node?

…e elastic search excluded IPs.

Signed-off-by: Abouzar Kamaee <[email protected]>
@A-Kamaee A-Kamaee force-pushed the eviction-error-cleanup branch from a3aa704 to ea7ab5f Compare June 12, 2024 13:00
@A-Kamaee
Copy link
Contributor Author

Would it make sense to have es-operator do log a warning and kill of the node in case each of the shards have another copy outside this node?

I don't think it's a good idea. There's a reason Elasticsearch can't empty the node. If es-operator abruptly kills the node, even if there are shards available in other nodes, the cluster might end up in an unknown state. WDYT?

@A-Kamaee
Copy link
Contributor Author

This fixes the clean variant of the described bug. However, we are still left with dirty exclude list in a messy case, when the process just crashes during the draining loop. It's a fairly realistic scenario as the draining loop potentially runs for hours.

Can you elaborate on the specific scenario you have in mind? Please also consider that in the event of a context Done, we will clean up the exclude list.

@A-Kamaee A-Kamaee force-pushed the eviction-error-cleanup branch 2 times, most recently from 067b299 to 0367cbf Compare June 12, 2024 16:29
…y and add tests for ESClient.excludePodIP function

Signed-off-by: Abouzar Kamaee <[email protected]>
@A-Kamaee A-Kamaee force-pushed the eviction-error-cleanup branch from 0367cbf to b1eb694 Compare June 12, 2024 16:30
@juhovuori
Copy link

This fixes the clean variant of the described bug. However, we are still left with dirty exclude list in a messy case, when the process just crashes during the draining loop. It's a fairly realistic scenario as the draining loop potentially runs for hours.

Actually, I just learned this is not an issue in es-operator exactly due to the mentioned annotation. Ignore my previous comment, the PR looks good IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants