-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: disruption.terminationGracePeriod #916
feat: disruption.terminationGracePeriod #916
Conversation
Hi @wmgroot. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
e94c2e4
to
47ba900
Compare
47ba900
to
fb8ad85
Compare
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Small comment is it possible to update the status of Nodeclaim when a node start getting forcibly drained? |
fb8ad85
to
f4ae490
Compare
f4ae490
to
ff8ed69
Compare
ff8ed69
to
95303b7
Compare
c7298be
to
c18da4e
Compare
949cb93
to
390a8b1
Compare
390a8b1
to
8dfd5f9
Compare
8a58b20
to
50c89c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments on the test coverage. I think there's the 2x2x2 combinatorics of graceful vs forceful, no tgp vs tgp, and pdb vs do-not-disrupt that made it hard to track the tests. you could consider doing a DescribeTable if that helps. Otherwise, just some comments on error wrapping.
50c89c5
to
7dc0512
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Excited to get this in, going to test on my own end before merging.
7dc0512
to
5423764
Compare
5423764
to
965cfbf
Compare
…s with do-not-disrupt pods and blocking PDBs Signed-off-by: wmgroot <[email protected]>
…odeClaim.spec Signed-off-by: wmgroot <[email protected]>
965cfbf
to
3e6aafc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: njtran, wmgroot The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #743
Description
This is a PR to add support for an equivalent to CAPI's
nodeDrainTimeout
feature for Karpenter.I've taken the feedback provided in #834.
This implementation relies on K8s 1.26 for default support of the non-graceful node shutdown Taint.
https://kubernetes.io/docs/concepts/architecture/nodes/#non-graceful-node-shutdown
Additionally, this PR implements the behavior described in https://github.com/jmdeal/karpenter/blob/disruption-grace-design/designs/termination-grace-period-extension.md. This allows the feature to be a comprehensive solution that addresses issues with disruption logic that previously prevented cluster administrators from enforcing drain timeout behavior for eventual disruption cases (such as node expiration and drift, but not consolidation).
Pods are also deleted preemptively at time T = node.expirationTime - pod.terminationGracePeriodSeconds. This attempts to ensure that a pod always has its full terminationGracePeriodSeconds worth of time to terminate gracefully before the node expires.
How was this change tested?
The PR includes unit tests for all added logic.
I've also tested this by building a custom Karpenter image and running it in our live clusters with various test cases by deleting nodes and nodeclaims directly to verify pods are are successfully deleted and the node is reaped when it reaches the desired expiration time.
Example
Initial Resources
Results
If the pods don't terminate cleanly, the terminator will eventually attempt a delete on all pods running on the node based on their terminationGracePeriodSeconds, including daemonsets and other infrastructure pods.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.