-
Notifications
You must be signed in to change notification settings - Fork 130
Description
How to categorize this issue?
/area control-plane
/kind enhancement
/priority 3
What would you like to be added:
Background
Currently, the machine-controller-manager moves Running machines to Unknown phase in case of errors and then to Failed phase after the configured machine-health-timeout. Failed machines are swiftly moved to the Terminating phase, the node is drained and the machine object deleted.
Need
There is a need for preserving VM's corresponding to Machines so that the operator/support/SRE can analyze and diagnose root cause of failure. However, there should be a limit to the number of machines that are preserved for the worker pool. There should also be a configurable timeout beyond which the MCM goes ahead with Machine termination.
We propose enhancing MachineConfiguration with FailedMachineTimeout *metav1.Duration and the MachineDeploymentSpec with the FailedMachinePreserveMax *int32. (Exact field names/locations are subject to change after design)
In addition, we will enhance the gardener machineControllerManager settings in the shoot spec to support operator configuration of the above fields in the worker pool in a separate gardener PR.
machineControllerManager:
failedMachinePreserveMax: 2
failedMachinePreserveTimeout: 3h- The MCM will annotate all preserved failed machines with
node.machine.sapcloud.io/preserve-when-failed=true - The user/operator can also explicitly mark a
Machineor its associatedNodewith the annotationnode.machine.sapcloud.io/preserve-when-failed=true. - If the current count of preserved
Failedmachines is at or exceedsfailedMachinePreserveMaxthen the annotation will not be accepted. (The annotation will be deleted) - If the current count of preserved
Failedmachines is at or exceedsfailedMachinePreserveMax, then anyUnknownMachines that move to theFailedphase will not be preserved and will be terminated. - The
failedMachinePreserveMaxMUST be set in the shoot spec, otherwise annotationnode.machine.sapcloud.io/preserve-when-failed=trueadded by operator/support to aMachinehas no effect. - Preserved failed machines can be removed before the
failedMachinePreserveTimeoutby setting thenode.machine.sapcloud.io/preserve-when-failed=falseannotation to the machine
Limitations
- During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to
Failedphase. Otherwise logic becomes overly complicated. - Since gardener worker pool can correspond to
1..NMachineDeployments depending on number of zones, we will need to distribute thefailedMachinePreserveMaxacrossNmachine deployments. So the number chosen should chosen appropriately
Why is this needed:
For operator/support/SRE diagnosis of VM's/Nodes.