-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Degraded NodePool Status Condition #1885
base: main
Are you sure you want to change the base?
Changes from 4 commits
eb4b625
2dbfaa6
067448e
954e4ff
134ef8e
67889ea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,74 @@ | ||||||
# RFC: Degraded NodePool Status Condition | ||||||
|
||||||
## Overview | ||||||
|
||||||
Karpenter can launch nodes with a NodePool that will never join a cluster when a NodeClass is misconfigured. | ||||||
|
||||||
One example is that when a network path does not exist due to a misconfigured VPC (network access control lists, subnets, route tables), Karpenter will not be able to provision compute with that NodeClass that joins the cluster until the error is fixed. Crucially, this will continue to charge users for compute that can never be used in a cluster. | ||||||
|
||||||
To improve visibility of these possible failure modes, this RFC proposes the addition of a `Degraded` status condition that indicates to cluster admins there may be a problem with a NodePool needs to be investigated and corrected. | ||||||
|
||||||
## Options | ||||||
|
||||||
### [Recommended] Option 1: Introduce a Generalized `Degraded` Status Condition | ||||||
|
||||||
``` | ||||||
// ConditionTypeDegraded = "Degraded" condition indicates that a misconfiguration exists that prevents the normal, successful use of a Karpenter resource | ||||||
ConditionTypeDegraded = "Degraded" | ||||||
``` | ||||||
|
||||||
This option would set `Degraded: true` on a NodePool whenever Karpenter suspects something is wrong with the launch path but isn't sure. In this case, if 3 or more NodeClaims fail to launch with a NodePool then the NodePool will be marked as degraded. The retries are included to account for transient errors. The number of failed launches is stored as a status on the NodePool and then reset to zero following an edit to the NodePool or a sufficient amount time has passed. | ||||||
|
||||||
``` | ||||||
// NodePoolStatus defines the observed state of NodePool | ||||||
type NodePoolStatus struct { | ||||||
// Resources is the list of resources that have been provisioned. | ||||||
// +optional | ||||||
Resources v1.ResourceList `json:"resources,omitempty"` | ||||||
// FailedLaunches tracks the number of times a nodepool failed before being marked degraded | ||||||
// +optional | ||||||
FailedLaunches int `json:"failedLaunches,omitempty"` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can we indicate that this is "failed launches in a row"? Does this mechanically reset if we get a successful launch?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is probably actually There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the number of times we hit registration TTL? The azure provider will use the native controller retries for successive launches of a nodeclaim. Attempt 1: Try with Cheapest sku -> FAIL Quota error(need a better quota api), cache sku as lacking quota for So sometimes with our provider we will have an expected launch failure, that we know will succeed on the retry. I like FailedRegistrations better, but also for some of these LRO operations, being able to report back from the cloudprovider side when a launch fails would be useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this should be Imagine if the NodePool spans multiple zones but is misconfigured, except for in a zone that the workload hardly uses. (eg: workload uses zones 1 and 2; cluster supports zones 1, 2 and 3). Launching in zone 3 might make things look better than they are. |
||||||
// Conditions contains signals for health and readiness | ||||||
// +optional | ||||||
Conditions []status.Condition `json:"conditions,omitempty"` | ||||||
} | ||||||
``` | ||||||
|
||||||
Once a NodePool is `Degraded`, it recovers with `Degraded: false` after an update to the NodePool or when the NodeClaim registration expiration TTL (currently 15 minutes) passes since the `lastTransitionTime` for the status condition on the NodePool, whichever comes first. A `Degraded` NodePool is not passed over when provisioning and may continue to be chosen during scheduling. A successful provisioning could also remove the status condition but this may cause more apiserver and metric churn than is necessary. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Doesn't this defeat the purpose? if it's 15m then I imagine if I'm having bootstrapping errors that I'll almost never hit this feature if I'm not scaling up 3 nodes at a time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe eventually configurable if we get it right. i could imagine users wanting to orchestrate a signal (or maybe leverage this one) to signal to Karpenter when they think something's wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This seems like a natural time to un mark the status condition. This only causes churn if the issue is ephemeral or racey, which in the ephemeral case, I'd imagine the NodePool isn't degraded. I'd like to understand what some of the racey cases would be though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yea, I was initially thinking it was the same as NodeClaim registration but it's probably 3x it to address this issue.
Yea, I think for introducing this we don't want to make functional changes yet but it could be done in the future. One thing I want to sort out is how to appropriately add/filter metrics for NodeClaims being provisioned for NodePools that are degraded. |
||||||
|
||||||
As additional misconfigurations are handled, they can be added to the `Degraded` status condition and the `Degraded` controller expanded to handle automated recovery efforts. This is probably most simpoly achieved by changing the Status Condition metrics to use comma-delimiting for `Reason`s with the most recent change present in the `Message`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there prior art for doing this? I'm wondering if we're actually going to be able to bubble up these failures correctly anyway, or if this is work we're carving out that wouldn't even be usable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If we wanted multiple reasons, the API would support a list here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can fire Events, though, and I think we can even store annotations onto those Events if we want to. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree that the csv style reasons is a bad idea.
I think it's possible we fire an event for the NodeClaim if the backing compute fails to join the cluster. The only time these events would spam is if the NodePool/NodeClass is so misconfigured that every initialization attempt fails. Looks like we currently only event for initial NodeClaim launch failures and not for failures in Registration and Initialization. |
||||||
|
||||||
``` | ||||||
- lastTransitionTime: "2024-12-16T12:34:56Z" | ||||||
message: "FizzBuzz component was misconfigured" | ||||||
observedGeneration: 1 | ||||||
reason: FizzBuzzFailure,FooBarFailure | ||||||
status: "True" | ||||||
type: Degraded | ||||||
``` | ||||||
|
||||||
This introduces challenges when determining when to evaluate contributors to the status condition but since the `Degraded` status condition only has a single contributor this decision can be punted. When the time comes to implement the multiple contributors to this status condition, this probably looks like a `Degraded` controller which acts as a "heartbeat" and evaluates each of the contributors. | ||||||
|
||||||
Finally, this status condition would not be a precondition for NodePool `Readiness` because the NodePool should still be considered for scheduling purposes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But this would need to be rolled up somehow from the NodeClass too, right? So we'd need a new nodepool "Degraded" condition which relies on the nodeclass "Degraded" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we actually assume that the NodeClass has a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably just track on the NodePool but we can do this in a way where which NodeClass was used is stored as a hash on the NodePool (see approach 4). |
||||||
|
||||||
#### Considerations | ||||||
|
||||||
1. 👎 Three retries can still be a long time to wait on compute that would never succeed | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I misunderstood, but resetting the timer on the registration timeout means we never hit this, right? |
||||||
2. 👍 Karpenter continues to try and launch with other potentially valid NodePools | ||||||
rschalo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
3. 👍 Observability improvements so that users can begin triaging misconfigurations | ||||||
4. 👍 `Degraded` is not a pre-condition for NodePool readiness | ||||||
|
||||||
### Option 2: Expand `Validated` Status Condition and Use Reasons | ||||||
|
||||||
The implementation is roughly the same except that validation is a pre-condition for `Readiness`. This has impact in a larger portion of the code because `Validated` would no longer block provisioning or `Readiness`. However, it is still an option that Karpenter could expand the `Valdiated` status condition so that any time a misconfiguration is encountered, the NodePool is treated as having failed validation. | ||||||
|
||||||
#### Considerations | ||||||
|
||||||
1. 👎👎 Validation implies the original state of the NodePool was correct and is something Karpenter can vet with certainty. A NodePool could have been correctly validated but then degraded. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. False positives. I think the severity of this con depends on what signals we're choosing to include. |
||||||
2. 👎👎 Changes the meaning of `Validated` in terms of `Readiness` | ||||||
3. 👎 Relies on statuses that were not part of the original validation of a NodePool. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I get what this con is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a NodePool can start valid and become invalid (eg due to someone writing a new spec, or because they deleted the node class object, or potentially even by the cloud provider changing its offering).
|
||||||
4. 👍 Status condition already exists | ||||||
|
||||||
### Further Dicussion Needed | ||||||
|
||||||
Should this status condition be used to affect Karpenter functionality/scheduling or should it only exist to improve observability? For example, NodePools with this status condition could be seen as having a lower weight than normal so that other NodePools are prioritized. This is probably more surprising than not for most users and should not be considered pursuing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought of something we could do to make it scale to the size of the cluster? And which errors are included in "failing to launch"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about false positives for "waiting to edit" the nodepool. I'm guessing this would be bubbled up from the nodeclass, but I think having the "amount of time" as a backstop makes sense. What amount of time are you thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the edit to a NodePool would be a signal that a user updated the NodePool and is expecting the issue to be resolved.
To keep it simple, 3x the NodeClaim registration TTL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, the number of NodePools/NodeClasses wouldn't affect the number of failures since we're looking at the state of a single NodePool. I think the cluster size would just affect how quickly a NodePool is considered degraded.
This currently includes node registration failures.