Skip to content

Support Preservation of Failed Machines for diagnostics #1008

@elankath

Description

@elankath

How to categorize this issue?

/area control-plane
/kind enhancement
/priority 3

What would you like to be added:

Background

Currently, the machine-controller-manager moves Running machines to Unknown phase in case of errors and then to Failed phase after the configured machine-health-timeout. Failed machines are swiftly moved to the Terminating phase, the node is drained and the machine object deleted.

Need

There is a need for preserving VM's corresponding to Machines so that the operator/support/SRE can analyze and diagnose root cause of failure. However, there should be a limit to the number of machines that are preserved for the worker pool. There should also be a configurable timeout beyond which the MCM goes ahead with Machine termination.

We propose enhancing MachineConfiguration with FailedMachineTimeout *metav1.Duration and the MachineDeploymentSpec with the FailedMachinePreserveMax *int32. (Exact field names/locations are subject to change after design)

In addition, we will enhance the gardener machineControllerManager settings in the shoot spec to support operator configuration of the above fields in the worker pool in a separate gardener PR.

machineControllerManager:
   failedMachinePreserveMax: 2
   failedMachinePreserveTimeout: 3h
  • The MCM will annotate all preserved failed machines with node.machine.sapcloud.io/preserve-when-failed=true
  • The user/operator can also explicitly mark a Machine or its associated Node with the annotation node.machine.sapcloud.io/preserve-when-failed=true.
  • If the current count of preserved Failed machines is at or exceeds failedMachinePreserveMax then the annotation will not be accepted. (The annotation will be deleted)
  • If the current count of preserved Failed machines is at or exceeds failedMachinePreserveMax, then any Unknown Machines that move to the Failed phase will not be preserved and will be terminated.
  • The failedMachinePreserveMax MUST be set in the shoot spec, otherwise annotation node.machine.sapcloud.io/preserve-when-failed=true added by operator/support to a Machine has no effect.
  • Preserved failed machines can be removed before the failedMachinePreserveTimeout by setting the node.machine.sapcloud.io/preserve-when-failed=false annotation to the machine

Limitations

  • During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. Otherwise logic becomes overly complicated.
  • Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the failedMachinePreserveMax across N machine deployments. So the number chosen should chosen appropriately

Why is this needed:

For operator/support/SRE diagnosis of VM's/Nodes.

Metadata

Metadata

Labels

area/control-planeControl plane relatedeffort/3mEffort for issue is around 3 monthskind/enhancementEnhancement, improvement, extensionpriority/3Priority (lower number equals higher priority)size/xlSize of pull request is huge (see gardener-robot robot/bots/size.py)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions