Skip to content

[WIP] Extending the semantics of nominated node name #5329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wojtek-t
Copy link
Member

  • One-line PR description:
  • Issue link:
  • Other comments:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 22, 2025
@k8s-ci-robot k8s-ci-robot requested review from macsko and palnabarun May 22, 2025 09:09
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling May 22, 2025
@wojtek-t wojtek-t marked this pull request as draft May 22, 2025 09:09
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 22, 2025
@@ -205,21 +205,22 @@ based on the expected pod placement.

### External components want to specify a preferred pod placement

The cluster autoscaler or Kueue internally calculates the pod placement,
and create new nodes or un-gate pods based on the calculation result.
The ClusterAutoscaler or Karpenter internally calculate the pod placement,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho - there are multiple changes including:

  • small updates to motivation and user stories
  • rewrite of risks section based on the end state we propose
  • discussion of new risk
  • addressing some comments discussed on the original PR

I think that you should take adjustments from my commit that you think make sense and just incorporate them to your PR and I will close this one after that.

be treated as a hint to scheduler and not as an official scheduler's plan of record
- if `NominatedNodeName` was set by an external component (based on ManagedFields), this
component should clear or update it (using PUT or APPLY to ensure it won't conflict with
potentil update from scheduler) to reflect the new hint
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dom4ha - in the context of different phases of reservation/allocation of resources that we were talking about, I think that what I'm proposing above paths they way towards additive extension in the future

With such approach, we're describing how the semantics should work, and eventually we can add stage field that would reflect that more explicitly.

Comment on lines +365 to +366
- [to be confirmed during API review] we fail the validation when someone attempts to set
`NNN` for an already bound pod
Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this? From our perspective, it does not make sense to set NNN for bound pods though, I'm afraid that this validation would just break some external components unnecessarily, which have already utilized NNN and have a (weird?) behavior to set NNN for bound pods for some purpose.
I'm not sure if we should take such a weird case into account, which we're not even sure if exists, though, at the same time, I'm not sure if we should explicitly try to ban those API operations on purpose, from now.

Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, what I wanted to say is: do we have any advantage to do this by taking a risk of breaking existing components that might exist in this world? If someone does this, then it might make sense to them for their specific use cases. And, it's just their fault that, as a result, NNN vs NodeName could be different and might be confusing. But, they should already know that, and still do that on purpose for some reason. I'm not sure if we should explicitly forbid it from now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation was that it's misleading when NodeName and NominatedNodeName are different, no matter when they were set. This second part was addressing the case when NNN is set later.

I'm fine with saying that it's such a niche thing that we don't do that for the reasons you mentioned.

Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is if NNN is set later, that might be because of some use cases that we haven't considered. Of course, we can say NNN for using such unknown use cases isn't supported, and just ban such requests for bound pods. But, I guess we don't have a strong reason to do that deliberately. So, I think we can hold the second one for now.

OTOH, doing the first one sounds ok, in order to reduce the side effect caused by this KEP. Because, after this proposal, we believe more components will start using NNN, and, as a result, more pods could eventually be bound to nodes different from NNN, which is "side effect" that I meant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and already adjusted it accordingly.

should try to modify it
- if `NominatedNodeName` was set by an external component (based on ManagedFields), it should
be treated as a hint to scheduler and not as an official scheduler's plan of record
- if `NominatedNodeName` was set by an external component (based on ManagedFields), this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- introducing a new field (e.g. `NodeNameHint` or sth like that for that purpose)
- using `pod.ObjectMeta.ManagedFields` to detect who set the field

Given that, even as of now arbitrary component or actor can already set `NNN`, we suggest to
Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep arguing in GH threads and Slack DM though, I still don't see why we need to distinguish "who" set NNN, right now. I don't see any use case or problem that requires us to be conscious of who set NNN. (and I've asked what's usecase/problem you both are seeing.) And, unless we find the one, I'd prefer just simply treating NNN equally, regardless of who set it.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN set by others, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?
Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

My alternative rules here is:

  • Regardless of who set NNN for what purpose, external components shouldn't update/clear NNN set by others.
    • Only the scheduler is allowed to overwrite any NNN, in case of preemption or at the beginning of binding cycles. This is based on the assumption that the reliability of the scheduler's preemption decision is higher than other components' hints because the scheduler always simulates the scheduling perfectly. But, if we don't agree on this assumption, then I'm even OK not to execute the preemptions when a pod has NNN, so that even the preemption isn't allowed to overwrite NNN from other components.
  • Anyone (including the scheduler) who set NNN on the pod should be responsible for clearing NNN or updating it with a new hint.
  • Regardless of who set NNN for what purpose, NNN readers (e.g., the cluster autoscaler when it's trying to scale down the nodes) have to always take NNN into consideration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

I should have said: "no external component should try to modify NNN if it was set by someone else".
I agree with that.
But any component can modify it if it was previously set by them.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?

By everyone (including scheduler itself).
In the end it's scheduler that's making the final decision, and up until then it can be arbitrary garbage in the NNN field.

Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

We're entering the playground of multiple schedulers with this (if other components can do preemption).
I wanted to avoid that.

My alternative rules here is: [...]

I think we're both saying very similar thing

  1. you're saying that external components can update/clear NNN
    I'm saying there is an exception if that's you who set that, then you can change it later.
    But you're saying that in your second point - so we're saying the same thing

  2. You're saying that scheduler can change that anytime - this was implicit in my point.
    So again, we're saying the same thing

  3. You're saying that regardless who sets it it should be treated the same.

This seems to be the only real difference. If it was set by scheduler, we know that scheduler believes it can reach that state. If it was set by some other component, we may not even have a path to get there.

So it boils down to the question about usecases. I think the future usecase is "external schedulers".
We want to allow external schedulers to make scheduling, but at the same time to make it really work we need to provide coordination mechanism.

If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.
Now let's assume that there is lower-priority pod already running on that node. Someone would need to preempt it. To ensure some coordination, it would be scheduler.
So, depending who put the nomination, the difference is:

  • if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me
  • if it's scheduler - it's effectively my decision for now, and I'm already preempting the pods

However, as I wrote in 192a9a1#r2102065375 I think that by introducing these rules (and I think we're effectively saying the same thing), we're making it possible to extend that in the future in the additive way.

I like your wording though. I wil try to somehow emphasize the difference and switch to your wording today/tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I went ahead and tried to update it - PTAL

Copy link
Member

@sanposhiho sanposhiho May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the future usecase is "external schedulers".
...
If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.

This "external schedulers" use case is actually close to Kueue use case, isn't it?

  1. If they have to force the main scheduler to pick up their decision, they have to use a required node selector with NNN, and then un-gate pods.
  2. If they're ok with the main scheduler changing the decision if NNN isn't valid, then they can just set NNN only.

And, regardless of whether NNN is set by the scheduler (performed the preemption for the pod) or NNN is set by an external scheduler, the scheduling cycle simply ignores/honors those two NNNs based on the pod's priority.

If there's no space on the NNN node, the scheduler proceeds to the preemption. As mentioned first, if "external scheduler" needs the scheduler to pick up their NNN as a final result, then they must put a node selector so that the preemption should try making a space on nominated/selected node. In this way, it satisfies your example scenario: if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me.

So, TL;DR, I don't see "who sets NNN" as necessary to your "external scheduler" use case as well. It's just all about the pod priority; the scheduling cycle doesn't (have to) understand NNN is from who.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the improvement/clarification on the KEP, overall agree except a few points that I left comments.

@@ -402,21 +444,34 @@ Higher-priority pods can ignore it, but pods with equal or lower priority don't
This allows us to prioritize nominated pods when nomination was done by external components.
We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.

TODO: We need to ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that scenario cause the memory increase and leak in the scheduler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pure implementation comment.
If the node has NNN set and such NNN doesn't exist, we need to ensure that scheduler cache will be appropriately cleared when NNN changes (i.e. the trigger for deleting the cached information about the node is no longer only the node deletion).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node has NNN set and such NNN doesn't exist, the scheduler should do nothing with the cache. i.e., no memory increase in the scheduler cache by NNN to non-existing nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that we have to cache that information.

Because as soon as the node comes up (cluster-autoscaler case), we need to know from the beginning that some resources on it are already "booked" via pods nominated to it. Or do you want to search through all pods at that point? That's certainly an option too, it's less performant, but maybe it's still good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, that is a good point. I agree with you now; we should keep index pods with NNN even if NNN node doesn't exist, and properly drop that index when NNN is updated/cleared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

3 participants