How to categorize this issue?
/area control-plane
/kind enhancement
/priority 3
What would you like to be added:
DWD today checks if the % of expired node leases is above a configured threshold then it will scale down the configured dependent resources (today that is KCM, MCM and CA). What it lacks is an ability to distinguish if the kubelet was unable to renew its lease due to network problems or was its own health or node's health. If the kubelet or the node is unhealthy then DWD should not scale down MCM and KCM and let them collaborate in replacing the unhealthy node.
Unhealthy Node is determined by looking at node conditions. Some of the node conditions are as below:
- DiskPressure
- KernelDeadlock
- ReadOnlyFileSystem
- KubeletUnavailable (this is set by node-problem-detector - currently this health checker plugin which sets this condition has not been enabled in g/g)
The above list is not comprehensive and therefore this list should be made configurable.
We therefore wish to introduce the following:
For all leases that are about to expire, check if the respective Node is present. If it is present then check the Conditions on the Node object. If it is deduced that the node conditions indicate an unhealthy node then that node should not be counted in the set of nodes which have expired leases.
Let us take an example to explain this better:
- Consider a total of 10 nodes in a shoot cluster.
- Threshold number of expired leases: 60% = 6 nodes (If there are 6 or more nodes having leases that are about to expire then DWD should consider scale down of dependent resources).
- Let us assume that 7 nodes have leases that are about to expire.
- 2 out of these 7 nodes are unhealthy.
What happens today: Since 7 nodes have leases that are about to expire, all dependent resources are scaled down (this includes MCM and KCM). This results in 2 unhealthy nodes not being replaced by MCM.
What we wish to have: 2 unhealthy nodes should be replaced by MCM but rest 5 should not be replaced.
How to categorize this issue?
/area control-plane
/kind enhancement
/priority 3
What would you like to be added:
DWD today checks if the % of expired node leases is above a configured threshold then it will scale down the configured dependent resources (today that is KCM, MCM and CA). What it lacks is an ability to distinguish if the kubelet was unable to renew its lease due to network problems or was its own health or node's health. If the kubelet or the node is unhealthy then DWD should not scale down MCM and KCM and let them collaborate in replacing the unhealthy node.
Unhealthy Nodeis determined by looking at node conditions. Some of the node conditions are as below:The above list is not comprehensive and therefore this list should be made configurable.
We therefore wish to introduce the following:
For all leases that are about to expire, check if the respective
Nodeis present. If it is present then check theConditionson theNodeobject. If it is deduced that the node conditions indicate an unhealthy node then that node should not be counted in the set of nodes which have expired leases.Let us take an example to explain this better:
What happens today: Since 7 nodes have leases that are about to expire, all dependent resources are scaled down (this includes MCM and KCM). This results in 2 unhealthy nodes not being replaced by MCM.
What we wish to have: 2 unhealthy nodes should be replaced by MCM but rest 5 should not be replaced.