Skip to content

✨ MachineHealthCheck supports checking Machine conditions #12275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

justinmir
Copy link

@justinmir justinmir commented May 23, 2025

What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation during control plane upgrades.

This PR introduces a new fieldas part of the MachineHealthCheckSpec:
UnhealthyMachineConditions

This will mirror the behavior of UnhealthyNodeConditions but the MachineHealthCheck controller will instead check the machine conditions.

Which issue(s) this PR fixes:
Fixes #5450

Label(s) to be applied
/kind feature
/area machinehealthcheck

Notes for Reviewers
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions in the following ways:

  • internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go includes a new envtest
  • internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go includes a unit test that the machine will need remediation.
  • Remaining test changes are boilerplate to ensure that this doesn't break existing functionality, everyplace we use UnhealthyNodeConditions we also specify a UnhealthyMachineConditions.
  • Fuzz tests are updated for old APIs to drop the UnhealthyMachineConditions field since this is not specified in old APIs.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2025
@k8s-ci-robot k8s-ci-robot requested a review from JoelSpeed May 23, 2025 01:39
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from vincepri May 23, 2025 01:39
@k8s-ci-robot k8s-ci-robot added the do-not-merge/needs-area PR is missing an area label label May 23, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @justinmir!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @justinmir. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@justinmir justinmir force-pushed the mhc-checks-machine-conditions branch from 00e4c09 to b7b110b Compare May 23, 2025 01:44
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2025
@justinmir justinmir force-pushed the mhc-checks-machine-conditions branch from b7b110b to 56608f2 Compare May 23, 2025 14:45
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2025
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2025
@@ -69,6 +69,14 @@ type MachineHealthCheckSpec struct {
// +kubebuilder:validation:MaxItems=100
UnhealthyConditions []UnhealthyCondition `json:"unhealthyConditions,omitempty"`

// unhealthyMachineConditions contains a list of the machine conditions that determine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not enough bandwidth for a full review, but a few quick comments already:

  • we should only add this to v1beta2
  • we have to add this in a few places. I would recommend to rebase on top of main and search for all occurences of "nodeUnhealthyConditions" and check in which places we also need "unhealthyMachineConditions" (it will be in a lot of these cases :))

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2025
justinmir added 2 commits May 25, 2025 12:18
MachineHealthCheck currently only allows checking Node conditions to
validate if a machine is healthy. However, machine conditions capture
conditions that do not exist on nodes, for example, control plane node
conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate
if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation
during control plane upgrades.

This PR introduces a new fieldas part of the MachineHealthCheckSpec:
  - `UnhealthyMachineConditions`

This will mirror the behavior of `UnhealthyNodeConditions` but the
MachineHealthCheck controller will instead check the machine conditions.
@justinmir justinmir force-pushed the mhc-checks-machine-conditions branch from 023c191 to 097384d Compare May 25, 2025 17:30
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2025
@justinmir
Copy link
Author

/area machinehealthcheck

@k8s-ci-robot k8s-ci-robot added area/machinehealthcheck Issues or PRs related to machinehealthchecks and removed do-not-merge/needs-area PR is missing an area label labels May 25, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MHC should provide support for checking Machine conditions
3 participants