Skip to content

[Enhancement] Consider node conditions apart from node leases to take more informed decisions for scale down #110

@unmarshall

Description

@unmarshall

How to categorize this issue?

/area control-plane
/kind enhancement
/priority 3

What would you like to be added:
DWD today checks if the % of expired node leases is above a configured threshold then it will scale down the configured dependent resources (today that is KCM, MCM and CA). What it lacks is an ability to distinguish if the kubelet was unable to renew its lease due to network problems or was its own health or node's health. If the kubelet or the node is unhealthy then DWD should not scale down MCM and KCM and let them collaborate in replacing the unhealthy node.

Unhealthy Node is determined by looking at node conditions. Some of the node conditions are as below:

  • DiskPressure
  • KernelDeadlock
  • ReadOnlyFileSystem
  • KubeletUnavailable (this is set by node-problem-detector - currently this health checker plugin which sets this condition has not been enabled in g/g)

The above list is not comprehensive and therefore this list should be made configurable.

We therefore wish to introduce the following:
For all leases that are about to expire, check if the respective Node is present. If it is present then check the Conditions on the Node object. If it is deduced that the node conditions indicate an unhealthy node then that node should not be counted in the set of nodes which have expired leases.

Let us take an example to explain this better:

  • Consider a total of 10 nodes in a shoot cluster.
  • Threshold number of expired leases: 60% = 6 nodes (If there are 6 or more nodes having leases that are about to expire then DWD should consider scale down of dependent resources).
  • Let us assume that 7 nodes have leases that are about to expire.
  • 2 out of these 7 nodes are unhealthy.

What happens today: Since 7 nodes have leases that are about to expire, all dependent resources are scaled down (this includes MCM and KCM). This results in 2 unhealthy nodes not being replaced by MCM.

What we wish to have: 2 unhealthy nodes should be replaced by MCM but rest 5 should not be replaced.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/control-planeControl plane relatedkind/enhancementEnhancement, improvement, extensionlifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.priority/1Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions