Add consolidationPolicy: Underweight #1829

koreyGambill · 2024-11-19T00:56:37Z

Description

What problem are you trying to solve?
We've created fallback on-demand NodePools with lower scheduling weight than our spot instance NodePools (AWS). When spot instances are hard to find, Karpenter schedules our (fallback) on-demand ec2, but it never consolidates back to the spot instances so it ends up being really expensive. I would love an official setting that allows Karpenter to consolidate based on weighted preferences rather than just utilization.

In this feature, if all the pods on a low-weight node are compatible with a higher-weight node, Karpenter should work to create the higher-weight node and re-schedule the pods. For us, it would help reduce costs, but in general it makes sense that users would care about using higher weighted nodes. I would expect this to still obey the consolidateAfter setting.

Something like this could work in the yaml

disruption:
    # Changed to a list type for the purpose of clearer yaml now that there are 3 options
    consolidationPolicy: 
      # If Underutilized and Underweight are set, Karpenter will re-schedule
      # the node if some pods can be put on a higher weight node, and the 
      # rest could fit on other existing nodes of the same weight
      - Empty
      - Underutilized
      - Underweight  # This would allow consolidating if all pods could be put on a higher weight node

How important is this feature to you?
Low-Medium - I have a workaround (setting the on-demand NodePool to expire after 4hrs), but it has a couple drawbacks.

We are waiting up to 4hrs to get back to our optimal state
We cannot re-use the NodePool for workloads that need a long lifespan. With this feature it would be possible since Karpenter wouldn't be able to reconsolidate the node if those pods were on it (due to taints/tolerations/affinities).

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2024-12-10T18:01:54Z

In general, this is the way that Karpenter should perform. Can you walk through the exact scenario that you are seeing? When Karpenter performs its simulations, it's going to consider the highest weight NodePool first (which should be your spot NodePool). Once it finds a scheduling decision, so long as the newer instance type is cheaper, it will consolidate it. All of this should be true in your scenario where you are moving back from on-demand to spot.

jonathan-innis · 2024-12-10T18:01:59Z

/triage needs-information

koreyGambill · 2024-12-10T21:28:19Z

The behavior we observed was that the lower weighted nodes (which were the on-demand nodes) would never be consolidated. Over time, we end up with more lower weighted nodes than higher weighted nodes since they don't experience spot terminations. So most of our nodes are actually the on-demand nodes which have a lower weight.

I wouldn't expect the consolidation to work as is, though. Our consolidationPolicy is set to "Empty". It would seem like a bug to me then, if Karpenter was consolidating those nodes for budgetary reasons. We cannot use the consolidationPolicy "Underutilized" because it results in a massive turnover of nodes which costs a ton through AWS Config costs.

I'm proposing a 3rd reason to consolidate, which would be if a higher weighted node would be available to schedule on. Which, seems distinctly different to me than "Underutilized" which should consolidate if the total CPU/Mem requested could be reduced by shuffling pods and shutting down nodes.

github-actions · 2024-12-25T12:01:45Z

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

mariuskimmina · 2024-12-25T12:22:18Z

@koreyGambill I feel we have been running into a similar situation, what seems to work for us at the moment is to use WhenEmpty on the nodepool that has only spot instances and WhenUnderutilized on the nodepool that serves as on-demand fallback. This way we avoid disruption if the workloads are running on spot but allow them if they are scheduled on-demand.

That said, we have also been feeling like a consolidation policy that captures the case of "I have both spot and on-demand instances in this nodepool and I want to only ever consolidate on-demand nodes back to spot instances" is missing.

koreyGambill added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 19, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 19, 2024

k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2024

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add consolidationPolicy: Underweight #1829

Add consolidationPolicy: Underweight #1829

koreyGambill commented Nov 19, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

koreyGambill commented Dec 10, 2024

github-actions bot commented Dec 25, 2024

mariuskimmina commented Dec 25, 2024 •

edited

Loading

Add consolidationPolicy: Underweight #1829

Add consolidationPolicy: Underweight #1829

Comments

koreyGambill commented Nov 19, 2024

Description

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

koreyGambill commented Dec 10, 2024

github-actions bot commented Dec 25, 2024

mariuskimmina commented Dec 25, 2024 • edited Loading

mariuskimmina commented Dec 25, 2024 •

edited

Loading