[WIP] Extending the semantics of nominated node name #5329

wojtek-t · 2025-05-22T09:09:19Z

One-line PR description:

Issue link:

Other comments:

k8s-ci-robot · 2025-05-22T09:09:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2025-05-22T09:14:53Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

@@ -205,21 +205,22 @@ based on the expected pod placement.

 ### External components want to specify a preferred pod placement

-The cluster autoscaler or Kueue internally calculates the pod placement,
-and create new nodes or un-gate pods based on the calculation result. 
+The ClusterAutoscaler or Karpenter internally calculate the pod placement,


@sanposhiho - there are multiple changes including:

small updates to motivation and user stories

rewrite of risks section based on the end state we propose

discussion of new risk

addressing some comments discussed on the original PR

I think that you should take adjustments from my commit that you think make sense and just incorporate them to your PR and I will close this one after that.

wojtek-t · 2025-05-22T09:19:55Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+be treated as a hint to scheduler and not as an official scheduler's plan of record
+- if `NominatedNodeName` was set by an external component (based on ManagedFields), this
+component should clear or update it (using PUT or APPLY to ensure it won't conflict with
+potentil update from scheduler) to reflect the new hint


@dom4ha - in the context of different phases of reservation/allocation of resources that we were talking about, I think that what I'm proposing above paths they way towards additive extension in the future

With such approach, we're describing how the semantics should work, and eventually we can add stage field that would reflect that more explicitly.

sanposhiho · 2025-05-22T11:36:06Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- [to be confirmed during API review] we fail the validation when someone attempts to set
+`NNN` for an already bound pod


Do we actually need this? From our perspective, it does not make sense to set NNN for bound pods though, I'm afraid that this validation would just break some external components unnecessarily, which have already utilized NNN and have a (weird?) behavior to set NNN for bound pods for some purpose.
I'm not sure if we should take such a weird case into account, which we're not even sure if exists, though, at the same time, I'm not sure if we should explicitly try to ban those API operations on purpose, from now.

Probably, what I wanted to say is: do we have any advantage to do this by taking a risk of breaking existing components that might exist in this world? If someone does this, then it might make sense to them for their specific use cases. And, it's just their fault that, as a result, NNN vs NodeName could be different and might be confusing. But, they should already know that, and still do that on purpose for some reason. I'm not sure if we should explicitly forbid it from now

My motivation was that it's misleading when NodeName and NominatedNodeName are different, no matter when they were set. This second part was addressing the case when NNN is set later.

I'm fine with saying that it's such a niche thing that we don't do that for the reasons you mentioned.

My point is if NNN is set later, that might be because of some use cases that we haven't considered. Of course, we can say NNN for using such unknown use cases isn't supported, and just ban such requests for bound pods. But, I guess we don't have a strong reason to do that deliberately. So, I think we can hold the second one for now.

OTOH, doing the first one sounds ok, in order to reduce the side effect caused by this KEP. Because, after this proposal, we believe more components will start using NNN, and, as a result, more pods could eventually be bound to nodes different from NNN, which is "side effect" that I meant.

I went ahead and already adjusted it accordingly.

sanposhiho · 2025-05-22T12:00:59Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+should try to modify it
+- if `NominatedNodeName` was set by an external component (based on ManagedFields), it should
+be treated as a hint to scheduler and not as an official scheduler's plan of record
+- if `NominatedNodeName` was set by an external component (based on ManagedFields), this


sanposhiho · 2025-05-22T12:07:04Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- introducing a new field (e.g. `NodeNameHint` or sth like that for that purpose)
+- using `pod.ObjectMeta.ManagedFields` to detect who set the field
+
+Given that, even as of now arbitrary component or actor can already set `NNN`, we suggest to


I keep arguing in GH threads and Slack DM though, I still don't see why we need to distinguish "who" set NNN, right now. I don't see any use case or problem that requires us to be conscious of who set NNN. (and I've asked what's usecase/problem you both are seeing.) And, unless we find the one, I'd prefer just simply treating NNN equally, regardless of who set it.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN set by others, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?
Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

My alternative rules here is:

Regardless of who set NNN for what purpose, external components shouldn't update/clear NNN set by others.

Only the scheduler is allowed to overwrite any NNN, in case of preemption or at the beginning of binding cycles. This is based on the assumption that the reliability of the scheduler's preemption decision is higher than other components' hints because the scheduler always simulates the scheduling perfectly. But, if we don't agree on this assumption, then I'm even OK not to execute the preemptions when a pod has NNN, so that even the preemption isn't allowed to overwrite NNN from other components.

Anyone (including the scheduler) who set NNN on the pod should be responsible for clearing NNN or updating it with a new hint.

Regardless of who set NNN for what purpose, NNN readers (e.g., the cluster autoscaler when it's trying to scale down the nodes) have to always take NNN into consideration.

First point, you said "NominatedNodeName was set by scheduler, no external component should try to modify it". Why the scheduler has to be special; IMO, no external component should try to modify NNN, regardless of who set it. Otherwise, two external components could keep trying to update NNN over the other's NNN.

I should have said: "no external component should try to modify NNN if it was set by someone else".
I agree with that.
But any component can modify it if it was previously set by them.

Second point, it should be treated as a hint to scheduler and not as an official scheduler's plan of record. What does it specifically mean? "Should be treated" by who? External components?

By everyone (including scheduler itself).
In the end it's scheduler that's making the final decision, and up until then it can be arbitrary garbage in the NNN field.

Does it mean, for example, CA is allowed to delete a node that pending pods have on NNN if NNN is set by the external component because NNN is just a hint in that case? (If Yes,) What if the external component is trying to do something similar to the preemption, that is, taking some time to make the node available for the pod with NNN?

We're entering the playground of multiple schedulers with this (if other components can do preemption).
I wanted to avoid that.

My alternative rules here is: [...]

I think we're both saying very similar thing

you're saying that external components can update/clear NNN
I'm saying there is an exception if that's you who set that, then you can change it later.
But you're saying that in your second point - so we're saying the same thing

You're saying that scheduler can change that anytime - this was implicit in my point.
So again, we're saying the same thing

You're saying that regardless who sets it it should be treated the same.

This seems to be the only real difference. If it was set by scheduler, we know that scheduler believes it can reach that state. If it was set by some other component, we may not even have a path to get there.

So it boils down to the question about usecases. I think the future usecase is "external schedulers".
We want to allow external schedulers to make scheduling, but at the same time to make it really work we need to provide coordination mechanism.

If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.
Now let's assume that there is lower-priority pod already running on that node. Someone would need to preempt it. To ensure some coordination, it would be scheduler.
So, depending who put the nomination, the difference is:

if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me

if it's scheduler - it's effectively my decision for now, and I'm already preempting the pods

However, as I wrote in 192a9a1#r2102065375 I think that by introducing these rules (and I think we're effectively saying the same thing), we're making it possible to extend that in the future in the additive way.

I like your wording though. I wil try to somehow emphasize the difference and switch to your wording today/tomorrow.

So I went ahead and tried to update it - PTAL

I think the future usecase is "external schedulers".
...
If I have scheduler A putting pod X on node N and at the same time scheduler B putting pod Y on the same node N then assuming those two can't fit together, one of these will need to be rejected.

This "external schedulers" use case is actually close to Kueue use case, isn't it?

If they have to force the main scheduler to pick up their decision, they have to use a required node selector with NNN, and then un-gate pods.

If they're ok with the main scheduler changing the decision if NNN isn't valid, then they can just set NNN only.

And, regardless of whether NNN is set by the scheduler (performed the preemption for the pod) or NNN is set by an external scheduler, the scheduling cycle simply ignores/honors those two NNNs based on the pod's priority.

If there's no space on the NNN node, the scheduler proceeds to the preemption. As mentioned first, if "external scheduler" needs the scheduler to pick up their NNN as a final result, then they must put a node selector so that the preemption should try making a space on nominated/selected node. In this way, it satisfies your example scenario: if it's external component - it's a proposal, and I want scheduler to preempt the other pod for me.

So, TL;DR, I don't see "who sets NNN" as necessary to your "external scheduler" use case as well. It's just all about the pod priority; the scheduling cycle doesn't (have to) understand NNN is from who.

sanposhiho

Thanks for working on the improvement/clarification on the KEP, overall agree except a few points that I left comments.

sanposhiho · 2025-05-22T12:13:37Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

@@ -402,21 +444,34 @@ Higher-priority pods can ignore it, but pods with equal or lower priority don't
 This allows us to prioritize nominated pods when nomination was done by external components. 
 We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.

+TODO: We need to ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.


How does that scenario cause the memory increase and leak in the scheduler?

It's a pure implementation comment.
If the node has NNN set and such NNN doesn't exist, we need to ensure that scheduler cache will be appropriately cleared when NNN changes (i.e. the trigger for deleting the cached information about the node is no longer only the node deletion).

If the node has NNN set and such NNN doesn't exist, the scheduler should do nothing with the cache. i.e., no memory increase in the scheduler cache by NNN to non-existing nodes.

My point is that we have to cache that information.

Because as soon as the node comes up (cluster-autoscaler case), we need to know from the beginning that some resources on it are already "booked" via pods nominated to it. Or do you want to search through all pods at that point? That's certainly an option too, it's less performant, but maybe it's still good enough.

Hmm, okay, that is a good point. I agree with you now; we should keep index pods with NNN even if NNN node doesn't exist, and properly drop that index when NNN is updated/cleared.

wojtek-t added 2 commits May 22, 2025 09:08

Extending the semantics of nominated node name

770913e

NominatedNodeName KEP adjustments

192a9a1

k8s-ci-robot requested review from macsko and palnabarun May 22, 2025 09:09

github-project-automation bot moved this to Needs Triage in SIG Scheduling May 22, 2025

github-project-automation bot added this to SIG Scheduling May 22, 2025

wojtek-t marked this pull request as draft May 22, 2025 09:09

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 22, 2025

wojtek-t mentioned this pull request May 22, 2025

kep-5278: nominated node name for an expected pod placement #5287

Draft

wojtek-t commented May 22, 2025

View reviewed changes

sanposhiho reviewed May 22, 2025

View reviewed changes

		- [to be confirmed during API review] we fail the validation when someone attempts to set
		`NNN` for an already bound pod

[WIP] Extending the semantics of nominated node name #5329

Are you sure you want to change the base?

[WIP] Extending the semantics of nominated node name #5329

Conversation

wojtek-t commented May 22, 2025

Uh oh!

k8s-ci-robot commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading

sanposhiho May 22, 2025 •

edited

Loading