Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: PBundyra The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
One concern I didn’t see addressed explicitly is the coupling between a Workload and the set of ResourceFlavors once Options are created. Today, ResourceFlavor lifecycle is largely decoupled from workloads: a non-admitted workload is naturally retried against whatever flavors exist at scheduling time, and RF add/remove events are implicitly reflected in subsequent scheduling attempts. With Concurrent Admission, each Option appears to bind a workload to a specific set of ResourceFlavors via Is this coupling intentional (i.e., RF membership is effectively snapshotted per Option), or is the expectation that the Option lifecycle controller reconciles Options in response to RF add/remove events? Either behavior seems reasonable, but the KEP currently doesn’t spell this out, and it feels like an observable semantic change compared to today’s behavior. It would be helpful to make this assumption explicit in the KEP, both for operator expectations and to avoid ambiguity around RF lifecycle handling. |
|
One additional case I didn’t see called out explicitly is how Concurrent Admission is expected to behave for workloads with multiple PodSets. In Kueue today, ResourceFlavor assignment happens per PodSet, and it is possible for different PodSets within the same Workload to be assigned different ResourceFlavors. With Concurrent Admission introducing Option Workloads that appear to model “attempts” against specific flavors or flavor sets, it’s not clear whether the intent is primarily whole-workload flavor placement (all PodSets landing on the same RF tier), or whether mixed PodSet → ResourceFlavor assignments within a single Option are an expected and supported outcome. The KEP seems to inherit existing per-PodSet flavor assignment behavior implicitly, but does not discuss this case explicitly or provide examples. It would be helpful to clarify whether this scenario is in scope for Concurrent Admission, and how it is expected to interact with upgrade semantics and |
| 1) Narrowing selection of ResourceFlavors for a given Workload. This however can be also used outside of the Concurrent Admissions feature, creating more flexibility for Kueue. | ||
| 2) Preempting sibling Options when admitting more preferable ones. | ||
|
|
||
| ### Risks and Mitigations |
There was a problem hiding this comment.
One risk to me is debuggability for the time-based options. It is sometimes hard to know when the accounting period started, etc.
|
I think this is quite a deep and useful feature. Would you like to maybe present it on the next wg-batch? |
💯 in agreement - this is supper cool! Would love to see a demo. |
| To achieve that, I configure my ClusterQueue to use the Concurrent Admission with `ExplicitOptions` policy. | ||
| I create a configuration for the Reservation Option with `AllowedResourceFlavors=["Reservation, Default-CPU"]` | ||
| and for the On-Demand Option with `AllowedResourceFlavors=["On-Demand", "Default-CPU"]`. | ||
|
|
There was a problem hiding this comment.
This story is a good setup to flush out the parent Workload vs WorkloadOption cardinality, especially in the presence of multiple PodSets.
Consider a small extension of the example:
-
GPU: two flavors, as in the story
- Reservation
- On-Demand
-
CPU: two flavors
- Default-CPU
- Special-CPU
Given a Workload with two PodSets (GPU and CPU), what does the resulting WorkloadOption list look like?
Is it the cross product of flavors across PodSets, for example:
- Reservation-GPU / Default-CPU
- Reservation-GPU / Special-CPU
- On-Demand-GPU / Default-CPU
- On-Demand-GPU / Special-CPU
If so, this would further amplify option fan-out, since the number of WorkloadOptions would grow as the product of flavor choices per PodSet rather than just the number of GPU flavors. It would be helpful to clarify whether this cross-product behavior is intended, and if not, how option generation is constrained in multi-PodSet scenarios.
There was a problem hiding this comment.
Well, it depends on the admin's intention. If we assume that the GPU is the dominant resource—which is often true in real-world setups—and that the migration of jobs is only relevant on the GPU axis (as it is the most expensive and scarce resource), then the number of WorkloadOptions would remain the same. An admin could simply add Special-CPU to the list of AllowedResourceFlavors to allow the workload to be scheduled on it.
If we wanted to migrate on the CPU axis as well, then in this example, it would indeed result in a cross-product of RFs. However, I consider this a less likely setup for real-world use cases
There was a problem hiding this comment.
Well, it depends on the admin's intention.
Here, and above comments with "scalability" context, I am considering the worst case scenario, i.e., Big-O.
There was a problem hiding this comment.
Still, I'd argue the worst case scenario depends on the use-case. With Concurrent Admission the number of Options per Job is a product of all migration dimension. If the only dimension we want to migrate is GPU, and we treat different CPUs flavors as equal good, then the product is equal 1 (CPU dimension) x #GPU-flavors. If we want to migrate both in GPU and CPU dimensions then indeed, the product is #CPU-flavors x #GPU-flavors.
Theoretically, this API allows to create 2^#RF-2 different Options, because that's the number of all RF subsets excluding an empty one, and the one containing all RFs. However in environment with a lot of different flavors I treat it more as a misconfiguration, rather than the real-world worst case scenario. Misconfiguration is already one the points in the Risk section
There was a problem hiding this comment.
I think I understand your point, and it may be partially related to an earlier issue I reported around the runtime complexity of the flavor assignment logic in Kueue: #6121.
That issue focused on the nested-loop structure in assignFlavor, where flavor resolution scales with the number of PodSets, resources per PodSet, and flavors per resource group. While the report itself may be somewhat outdated, I believe the underlying complexity analysis still holds and is worth keeping in mind as workloads and flavor configurations grow more complex.
From that perspective, the complexity already exists today. Splitting admission into multiple Workload objects (for example, per ResourceFlavor) does not introduce a new class of complexity, but instead helps scope and contain the existing evaluation work. Each Option operates over a narrower flavor set, which can make the admission logic easier to reason about and potentially reduce per-attempt cost.
There was a problem hiding this comment.
Theoretically, this API allows to create 2^#RF-2 different Options, because that's the number of all RF subsets excluding an empty one, and the one containing all RFs. However in environment with a lot of different flavors I treat it more as a misconfiguration, rather than the real-world worst case scenario. Misconfiguration is already one the points in the Risk section
Yes, but I think it is preferred to guide users against such misconfigurations. Adding validation to prevent blowing up complexity is the strategy in Kueue (say with limits for the number of flavors capped at 64, or resources). Similarly we cap the number of levels in Topology or the number of Clusters in MultiKueue.
This allows us to reason about the complexity. Sure, sometimes it means that we need to relax them if use cases prove higher numbers to be useful. It is much easier to relax than strengthen validaiton.
So, what about limiting the feature if the number of flavors is <= 8, or the number of ExplicitOptions <= 16.
| 2) Option Workload: A cloned view of the Parent Workload with specific scheduling constraints. Most notably, an Option is restricted to a subset of ResourceFlavors. | ||
|
|
||
| ### Architecture & Cardinality | ||
| The relationship between a Parent and its Options follows a parent–child model with 1:N cardinality (where $N \ge 1$). While the number of Options is typically determined by the variety of PodSets and ClusterQueue ResourceFlavors, each remains a distinct Kubernetes object persisted in etcd. |
There was a problem hiding this comment.
Thank you for expanding on this construct. I see clear value in the Parent/Option split, not only for flavor-specific admission and migration, but also as a way to reduce pressure on the current Workload object, which today is mutated by multiple concurrent Kueue controllers and serves several roles at once.
The Parent/Option model provides better scope isolation: the Parent acts as a stable definition and aggregation point, while Options encapsulate admission and scheduling context, without requiring changes to the scheduler or quota logic. This separation also hints at benefits beyond flavor-related scenarios, for example around clearer mutability boundaries and reduced update contention.
Looking ahead, this pattern suggests a possible Phase-2 evolution toward a more explicit admission-focused construct with a reduced and well-defined mutability surface, similar to other Kubernetes parent/child models. For now, treating Parent and Option as the same Workload type feels like a pragmatic choice, as long as the design keeps the door open for such an evolution and doesn’t lock us into this specific representation long-term.
| OptionStatePending = "Pending" | ||
|
|
||
| // OptionStateAdmitted means the Option has been admitted | ||
| OptionStateAdmitted = "Admitted" |
There was a problem hiding this comment.
Couldn't the option mechanism be used with AdmissinoChecks? If so then we distinguish also the "QuotaReserved" state. Does the design consider them as "Pending"? I'm not clear if we need to make that distinction, but this information seems useful for decision making.
|
|
||
| At any given point in time, only one Option per Parent may be admitted by Kueue. | ||
|
|
||
| To support this, we will introduce a new controller and extend the ClusterQueue API with a new `.spec` field to manage Option activation and deactivation. |
There was a problem hiding this comment.
What is meant by "activation" and "deactivation" here? I'm confused because the options are talking about "Remove", while deactivation is a technical term. I think it makes sense to actually use deactivation rather than removal in some cases.
| RemoveLower OnSuccessPolicy = "RemoveLower" | ||
|
|
||
| // Stop all attempts below a defined target RF. | ||
| RemoveBelowTarget OnSuccessPolicy = "RemoveBelowTarget" |
There was a problem hiding this comment.
I'm wondering about naming here. Even if removal is preferred (to offload API server) I think deactivation should give an analogous effects, but could provide better debuggabilty. So, I'm wondering if we could keep the naming more flexible and use case oriented, like:
- NoMigration
- AllowUpgrades
- AllowUpgradesAboveTarget
Then it would be secondary decision if we use removal or deactivation. It also seems more natural to an admin who cares about the effect rather than technical details.
wdyt?
There was a problem hiding this comment.
Also, I feel it is unclear what it means "lower" or "below" for an option which spans multiple flavors, say one below and one above. And here maybe the naming makes also difference.
For example:
- "RemoveLower" would intuitively mean to me: "remove the option as it contains target flavors lower than selected".
- "AllowUpgrades" would mean "keep the option, because it allows an upgrade"
| RemoveBelowTargetConfig *ConcurrentAdmissionRemoveBelowTargetConfig | ||
| } | ||
|
|
||
| type OptionCreationCustomization struct { |
There was a problem hiding this comment.
nit, align the naming with ConcurrentAdmissionExplicitOption
|
|
||
| ### Graduation Criteria | ||
|
|
||
| #### Alpha |
There was a problem hiding this comment.
Is there going to be some FG reflecting the maturity level. I know the API is opt-in, but without a feature gate indicating the level users may have wrong expectations about the maturity / stability of the new feature. Especially for complex features like this one some indication by feature gate is usually preferred.
| type WorkloadType string | ||
| const ( | ||
| Default WorkloadType = "Default" | ||
| ResourceFlavorOption WorkloadType = "ResourceFlavorOption" | ||
| Parent WorkloadType = "Parent" | ||
| ... // possibly more like WorkloadSlice, PrebuiltWorkload | ||
| ) |
There was a problem hiding this comment.
Let's move to alternatives and just link from here to offload the section with technical details. This section for large KEPs tends to grow.
| ${original_workload_name}-option-${explicit_option_name} | ||
| ``` | ||
|
|
||
| Note: Option names are designed to be deterministic. If a name collision occurs (due to long Workload/RF names), standard Kubernetes suffix truncation logic will be applied while maintaining the -option- identifier. |
There was a problem hiding this comment.
What is "standard Kubernetes suffix truncation logic" here? I'm not sure, we implement it in Kueue this ourselves in pkg/controller/jobframework/workload_names.go, so if we don't put an additional truncation it may very well exceed the limit. Not just nit picking - I just don't understand what / how will happen.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of #8691
Special notes for your reviewer:
Does this PR introduce a user-facing change?