Support Preservation of Failed Machines for diagnostics

**How to categorize this issue?**

/area control-plane
/kind enhancement
/priority 3

**What would you like to be added**:

#### Background
 Currently, the `machine-controller-manager` moves `Running` machines to `Unknown` phase in case of errors and then to `Failed` phase after the configured `machine-health-timeout`. `Failed` machines are swiftly moved to the `Terminating` phase, the node is drained and the machine object deleted.

#### Need
There is a need for preserving VM's corresponding to Machines so that the operator/support/SRE can analyze and diagnose root cause of failure.  However, there should be a limit to the number of machines that are preserved for the worker pool. There should also be a configurable timeout beyond which the MCM goes ahead with Machine termination. 

We propose enhancing [MachineConfiguration](https://github.com/gardener/machine-controller-manager/blob/95884c4be1686543981ad04349ae775f255345de/pkg/apis/machine/v1alpha1/shared_types.go#L30) with `FailedMachineTimeout *metav1.Duration` and the [MachineDeploymentSpec](https://github.com/gardener/machine-controller-manager/blob/95884c4be1686543981ad04349ae775f255345de/pkg/apis/machine/types.go#L446) with the `FailedMachinePreserveMax *int32`. (Exact field names/locations are subject to change after design)

In addition, we will enhance the gardener `machineControllerManager` settings in the shoot spec to support operator configuration of the above fields  in the worker pool in a separate gardener PR.

```yaml
machineControllerManager:
   failedMachinePreserveMax: 2
   failedMachinePreserveTimeout: 3h
```

- The MCM will annotate all preserved failed machines with `node.machine.sapcloud.io/preserve-when-failed=true`
- The user/operator can also explicitly mark a `Machine` or its associated `Node` with the annotation `node.machine.sapcloud.io/preserve-when-failed=true`.  
- If the current count of preserved `Failed` machines is at or exceeds `failedMachinePreserveMax` then the annotation will not be accepted. (The annotation will be deleted)
- If the current count  of preserved `Failed` machines is at or exceeds `failedMachinePreserveMax`, then any `Unknown` Machines that move to the `Failed` phase will not be preserved and will be terminated. 
- The `failedMachinePreserveMax` MUST be set in the shoot spec, otherwise annotation `node.machine.sapcloud.io/preserve-when-failed=true` added by operator/support to a `Machine` has no effect. 
- Preserved failed machines can be removed before the `failedMachinePreserveTimeout` by setting the `node.machine.sapcloud.io/preserve-when-failed=false` annotation to the machine

#### Limitations

- During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to `Failed` phase.  Otherwise logic becomes overly complicated.
- Since gardener worker pool can correspond to `1..N `MachineDeployments depending on number of zones, we will need to distribute the  `failedMachinePreserveMax` across `N` machine deployments. So the number chosen should chosen appropriately


**Why is this needed**:

For operator/support/SRE diagnosis of VM's/Nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Preservation of Failed Machines for diagnostics #1008

Background

Need

Limitations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Preservation of Failed Machines for diagnostics #1008

Description

Background

Need

Limitations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions