KEP #8826: Uber Cluster Queues#8864
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mwielgus The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
| Benefit: Compliance requirements are met without complex operational runbooks to "clear the | ||
| cluster." | ||
|
|
||
| ### Risks and Mitigations |
There was a problem hiding this comment.
One major issue would be abuse of the "Hero" queue.
Sorta related to Resource Starvation but I think its more if someone uses this to effectively skip fair sharing.
There was a problem hiding this comment.
What type of abuse do you have on your mind?
There was a problem hiding this comment.
I guess to me every user I have ever worked with treats their jobs as super important and would love to skip the line.
So I guess the main protection would be that admins would only create a localqueue pointing to this Hero Queue on users they trust won't just submit blindly to this Queue.
We don't really have any enforcement but it is sorta the spiderman analogy "With great power comes great responsibility".
There was a problem hiding this comment.
Access to UCQ can be restricted only to these individuals (namespaces/LQ) who the corresponding part of the organization trust. And yes, these people get great power over that set of quotas.
| * Nominal quotas stop to be guaranteed. | ||
| * Well understood rules start to have exceptions. | ||
|
|
||
| ## Alternatives |
There was a problem hiding this comment.
It isn't clear to me why you can't model this as a dedicated ClusterQueue and a WorkloadPriorityClass.
Could you create a ClusterQueue that only special "heroes" submit too and mark the workload priority class as critical?
There was a problem hiding this comment.
No matter how hight the priority is, it doesn't get into nominal quota. The whole trick is how to get into nominal quota with an outside workload and prevent their reclamation.
| Limitations of Current Quota Models The necessity for the UberClusterQueue stems from specific | ||
| rigidities in the current preemption logic. | ||
|
|
||
| * Guaranteed Quota Immunity: In the standard model, a workload running within its ClusterQueue's |
There was a problem hiding this comment.
With Uber ClusterQueues, nominal quota is no longer a hard protection boundary but a best-effort guarantee subject to administrative override. That’s a significant change in the guarantees Kueue provides to ClusterQueue owners. I think we should be explicit about this shift, including what guarantees remain, what guarantees are weakened, and how operators are expected to communicate and govern the use of Uber CQs.
Uber CQ doesn’t just add power, it changes the trust model. ClusterQueue owners now have to trust that the override mechanism will be used sparingly and responsibly, because the system itself can no longer enforce absolute protection.
There was a problem hiding this comment.
I agree that this is a significant change, however the alternatives work almost the same and the end result is the same.
In order to allow hero jobs, users cannot have absolute everlasting guarantees, because at some point in time that hero job might be started. And then their workloads will be interrupted. Either because of UCQ, or because someone temporarily changed quotas or because they used Fair Sharing with weights and nothing was ever guaranteed.
|
One concern I have with the Uber ClusterQueue approach is transparency. For the “wartime” scenarios this KEP targets, I actually want disruption to be explicit and explainable. With With Uber CQ, the override is implicit: peacetime configuration remains unchanged, yet workloads in other ClusterQueues can be evicted or lose capacity due to someone else’s configuration. From a CQ owner’s perspective, guarantees effectively change without any change to their own spec, which makes reasoning, debugging, and accountability harder. Even if the behavior is correct, it is much less discoverable why it happened. If this proposal moves forward, I think we need first-class signals (status, conditions, events) that explicitly indicate when a cohort and impacted CQs are effectively in a “wartime” mode and clearly attribute preemptions to an Uber CQ override. |
|
Another concern I have is precedent and escalation. Once we introduce a first-class mechanism that allows a ClusterQueue to override nominal quota guarantees, it becomes hard to draw a principled line against “hero’s hero” or multiple Uber ClusterQueues within the same cohort. Even if the intent is rare, one-off use, the API is permanent and will predictably attract requests for additional tiers, ordering between Uber CQs, or broader access over time. At that point, we risk turning a narrowly scoped exception into an implicit hierarchy of dominance between ClusterQueues, which feels at odds with Kueue’s current model where guarantees are explicit, local, and stable. If this proposal moves forward, I think we need very strong guardrails (for example, enforcing a single Uber CQ per cohort subtree, time-bounded activation, or similar) to prevent this kind of escalation. |
|
I seem to remember @dgrove-oss discussing with me how they implement something similar in IBM. Maybe you have some thoughts here? |
I see your point. The reason why there is no hard Hold/Hold and drain is that a CQ may be only partially affected. I try to distribute the impact of UCQ workload across many CQ. So this is not a binary scenario. Hero job may only need 50% of the subtree/CQ. Definitely we can add some status information into the CQ - that a UCQ is around and running workloads so things may look weird. The KEP already mentions about workload observability and metrics. |
I can imagine use cases where there are top-level UCQ and more local, inferior UCQ - it would work kind-of the same. Everything below or next to UCQ is treated as expandable, even if it is another below UCQ. 2xUCQ in the same cohort is weird and brings no point, can be banned. Time bounded - i would give the full control of the execution to the users selected by the organization. Someone vetted that they know what they are doing. If they need to run UCQ workloads for 5h, so be it. |
|
I wanted to add a quick note on intent. I didn’t want to spam or monopolize this KEP discussion with a deep dive into alternatives, which is why I moved the detailed exploration of a different approach into a separate issue. To be clear, I’m not trying to block this KEP. I think the motivating use case is real and important. My comments here are mostly about surfacing concerns around guarantees, precedent, and transparency, and exploring whether there might be other ways to address the same operational problem while preserving some of Kueue’s existing invariants. Happy to continue feedback on this proposal on its own merits, and equally happy to discuss alternatives in parallel without derailing the main thread. |
| * Borrowing Complexity: The current borrowing logic is constrained by fair sharing weights. A Hero | ||
| job should not be constrained by "fairness"—it is inherently unfair. | ||
|
|
||
| ### Goals |
There was a problem hiding this comment.
For Hero workloads what are we doing if there is a queue of Hero workloads submitted to the same CQ.
Would it be treated as FIFO?
There was a problem hiding this comment.
Regular rules apply. You can have either FIFO or BestEffort. A superhero job may preempt regular hero job based on priority. "Uberness" is between hero and regular workloads.
| ## Design Details | ||
|
|
||
|
|
||
| ### API |
There was a problem hiding this comment.
One area I would like to have discussed a bit more in the KEP, maybe in the Notes section is how plan to extend the API with the new configuration options. For example if we have use cases for excluding a certain CQ from the mechanism, or have some weights between the CQs which balance the quota taken from the queues.
I find this one of the main advantages of the alternative ResourceQuotaLease KEP proposal by @ichekrygin of introducing the dedicated CRD, that the configuration place is natural. And the lifetime of the custom configuration is nicely managed. Here the lifetime of the custom configuration is bound to CQ, so I think we should think ahead how we make the configuration intuitive to users.
Flushing out UCQ notifications to users could be a good mechanism to validate this KEP. By making UCQ impact explicit and user-visible, especially at the ClusterQueue level, we can test whether the proposed behavior is understandable, discoverable, and actionable for workload owners who only have CQ-scoped visibility. If users can clearly see when and why their nominal capacity is affected by a UCQ, it becomes much easier to reason about guarantees, debug unexpected disruption, and build trust in the mechanism. In that sense, well-defined UCQ notifications are not just an observability detail, they are a validation tool for whether the model itself is sound and usable in practice. |
| - [Goals](#goals) | ||
| - [Non-Goals](#non-goals) | ||
| - [Proposal](#proposal) | ||
| - [User Stories (Optional)](#user-stories-optional) |
There was a problem hiding this comment.
I'd be curious if you've seen examples where users are mixing inference and training workloads on a cluster. Where in some circumstance, inference workloads need to act as an "uber" workload that preempts training workloads. So for example maybe there's a spillover from the normal inference clusters during high traffic times.
There was a problem hiding this comment.
That's an interesting use case :). I haven't heard about that need, but sure, once can put whatever they like to UCQ, inference indluded.
|
+1 to @ichekrygin
An overall concern I have for Kueue is instrumentation for scheduling decisions. Given in the past we've seen various bugs where its difficult to validate what the expected behavior should be. With the more explicit api contract @ichekrygin is proposing with leases, as an in between step that uber queues could interact to manipulate, we don't have to wonder as much about Kueue's internal in memory state. Between this thread and the Lease thread... just want to confirm, it looks like lease is a pre-req for uber queues? |
Yes, some form of lease is required. UCQ does it automatically and kind-of behind the scenes, but sure, we could make it more explicit and visible. |
It would be very useful to flesh those details out explicitly in the KEP. |
@mwielgus More specifically. That leases can be used independently of UCQ? That's what I mean about pre-req. Like @mimowo is mentioning in this comment for the interaction of UCQ & Lease: #8869 (comment)
|
| // +optional | ||
| // +kubebuilder:default=DefaultPreemptionRules | ||
| // +kubebuilder:validation:Enum=DefaultPreemptionRules;UberClusterQueueRules | ||
| Rules PreemptionRules `json:"rules,omitempty"` |
There was a problem hiding this comment.
I find this "PreemptionRules" a bit misleading, because it dictates not just preemption, but also scheduling more broadly, for example this paragraph shows that also borrowing works different: https://github.com/kubernetes-sigs/kueue/pull/8864/changes#diff-55ae50e78b080bbdc5104cae08ba81a13caa5853e0094fae8ce10237bc3bec9eR126-R128
Maybe this is about calling the options "SchedulingRules"
|
Yeah, the more I think about it the more I see some need for CRD, because:
|
What do you want the CRD for? |
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
Introduces support for hero (huge) jobs.
Which issue(s) this PR fixes:
Fixes #8826
Special notes for your reviewer:
Does this PR introduce a user-facing change?