Skip to content

KEP #8826: Uber Cluster Queues#8864

Open
mwielgus wants to merge 1 commit intokubernetes-sigs:mainfrom
mwielgus:uber
Open

KEP #8826: Uber Cluster Queues#8864
mwielgus wants to merge 1 commit intokubernetes-sigs:mainfrom
mwielgus:uber

Conversation

@mwielgus
Copy link
Contributor

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

Introduces support for hero (huge) jobs.

Which issue(s) this PR fixes:

Fixes #8826

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 28, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mwielgus
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 28, 2026
@netlify
Copy link

netlify bot commented Jan 28, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 4b19785
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/697a7714bad3d000084f000d

Benefit: Compliance requirements are met without complex operational runbooks to "clear the
cluster."

### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major issue would be abuse of the "Hero" queue.

Sorta related to Resource Starvation but I think its more if someone uses this to effectively skip fair sharing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of abuse do you have on your mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to me every user I have ever worked with treats their jobs as super important and would love to skip the line.

So I guess the main protection would be that admins would only create a localqueue pointing to this Hero Queue on users they trust won't just submit blindly to this Queue.

We don't really have any enforcement but it is sorta the spiderman analogy "With great power comes great responsibility".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to UCQ can be restricted only to these individuals (namespaces/LQ) who the corresponding part of the organization trust. And yes, these people get great power over that set of quotas.

* Nominal quotas stop to be guaranteed.
* Well understood rules start to have exceptions.

## Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me why you can't model this as a dedicated ClusterQueue and a WorkloadPriorityClass.

Could you create a ClusterQueue that only special "heroes" submit too and mark the workload priority class as critical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter how hight the priority is, it doesn't get into nominal quota. The whole trick is how to get into nominal quota with an outside workload and prevent their reclamation.

Limitations of Current Quota Models The necessity for the UberClusterQueue stems from specific
rigidities in the current preemption logic.

* Guaranteed Quota Immunity: In the standard model, a workload running within its ClusterQueue's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Uber ClusterQueues, nominal quota is no longer a hard protection boundary but a best-effort guarantee subject to administrative override. That’s a significant change in the guarantees Kueue provides to ClusterQueue owners. I think we should be explicit about this shift, including what guarantees remain, what guarantees are weakened, and how operators are expected to communicate and govern the use of Uber CQs.

Uber CQ doesn’t just add power, it changes the trust model. ClusterQueue owners now have to trust that the override mechanism will be used sparingly and responsibly, because the system itself can no longer enforce absolute protection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a significant change, however the alternatives work almost the same and the end result is the same.

In order to allow hero jobs, users cannot have absolute everlasting guarantees, because at some point in time that hero job might be started. And then their workloads will be interrupted. Either because of UCQ, or because someone temporarily changed quotas or because they used Fair Sharing with weights and nothing was ever guaranteed.

@ichekrygin
Copy link
Contributor

One concern I have with the Uber ClusterQueue approach is transparency. For the “wartime” scenarios this KEP targets, I actually want disruption to be explicit and explainable. With Hold / HoldAndDrain, it’s immediately visible why workloads are evicted or not admitted, and the intent is clearly declared by an operator action.

With Uber CQ, the override is implicit: peacetime configuration remains unchanged, yet workloads in other ClusterQueues can be evicted or lose capacity due to someone else’s configuration. From a CQ owner’s perspective, guarantees effectively change without any change to their own spec, which makes reasoning, debugging, and accountability harder. Even if the behavior is correct, it is much less discoverable why it happened.

If this proposal moves forward, I think we need first-class signals (status, conditions, events) that explicitly indicate when a cohort and impacted CQs are effectively in a “wartime” mode and clearly attribute preemptions to an Uber CQ override.

@ichekrygin
Copy link
Contributor

Another concern I have is precedent and escalation. Once we introduce a first-class mechanism that allows a ClusterQueue to override nominal quota guarantees, it becomes hard to draw a principled line against “hero’s hero” or multiple Uber ClusterQueues within the same cohort. Even if the intent is rare, one-off use, the API is permanent and will predictably attract requests for additional tiers, ordering between Uber CQs, or broader access over time.

At that point, we risk turning a narrowly scoped exception into an implicit hierarchy of dominance between ClusterQueues, which feels at odds with Kueue’s current model where guarantees are explicit, local, and stable. If this proposal moves forward, I think we need very strong guardrails (for example, enforcing a single Uber CQ per cohort subtree, time-bounded activation, or similar) to prevent this kind of escalation.

@kannon92
Copy link
Contributor

I seem to remember @dgrove-oss discussing with me how they implement something similar in IBM.

Maybe you have some thoughts here?

@mwielgus
Copy link
Contributor Author

One concern I have with the Uber ClusterQueue approach is transparency. For the “wartime” scenarios this KEP targets, I actually want disruption to be explicit and explainable. With Hold / HoldAndDrain, it’s immediately visible why workloads are evicted or not admitted, and the intent is clearly declared by an operator action.

With Uber CQ, the override is implicit: peacetime configuration remains unchanged, yet workloads in other ClusterQueues can be evicted or lose capacity due to someone else’s configuration. From a CQ owner’s perspective, guarantees effectively change without any change to their own spec, which makes reasoning, debugging, and accountability harder. Even if the behavior is correct, it is much less discoverable why it happened.

If this proposal moves forward, I think we need first-class signals (status, conditions, events) that explicitly indicate when a cohort and impacted CQs are effectively in a “wartime” mode and clearly attribute preemptions to an Uber CQ override.

I see your point. The reason why there is no hard Hold/Hold and drain is that a CQ may be only partially affected. I try to distribute the impact of UCQ workload across many CQ. So this is not a binary scenario. Hero job may only need 50% of the subtree/CQ.

Definitely we can add some status information into the CQ - that a UCQ is around and running workloads so things may look weird. The KEP already mentions about workload observability and metrics.

@mwielgus
Copy link
Contributor Author

Another concern I have is precedent and escalation. Once we introduce a first-class mechanism that allows a ClusterQueue to override nominal quota guarantees, it becomes hard to draw a principled line against “hero’s hero” or multiple Uber ClusterQueues within the same cohort. Even if the intent is rare, one-off use, the API is permanent and will predictably attract requests for additional tiers, ordering between Uber CQs, or broader access over time.

At that point, we risk turning a narrowly scoped exception into an implicit hierarchy of dominance between ClusterQueues, which feels at odds with Kueue’s current model where guarantees are explicit, local, and stable. If this proposal moves forward, I think we need very strong guardrails (for example, enforcing a single Uber CQ per cohort subtree, time-bounded activation, or similar) to prevent this kind of escalation.

I can imagine use cases where there are top-level UCQ and more local, inferior UCQ - it would work kind-of the same. Everything below or next to UCQ is treated as expandable, even if it is another below UCQ. 2xUCQ in the same cohort is weird and brings no point, can be banned.

Time bounded - i would give the full control of the execution to the users selected by the organization. Someone vetted that they know what they are doing. If they need to run UCQ workloads for 5h, so be it.

@ichekrygin
Copy link
Contributor

I wanted to add a quick note on intent. I didn’t want to spam or monopolize this KEP discussion with a deep dive into alternatives, which is why I moved the detailed exploration of a different approach into a separate issue.

To be clear, I’m not trying to block this KEP. I think the motivating use case is real and important. My comments here are mostly about surfacing concerns around guarantees, precedent, and transparency, and exploring whether there might be other ways to address the same operational problem while preserving some of Kueue’s existing invariants.

Happy to continue feedback on this proposal on its own merits, and equally happy to discuss alternatives in parallel without derailing the main thread.

* Borrowing Complexity: The current borrowing logic is constrained by fair sharing weights. A Hero
job should not be constrained by "fairness"—it is inherently unfair.

### Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hero workloads what are we doing if there is a queue of Hero workloads submitted to the same CQ.

Would it be treated as FIFO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular rules apply. You can have either FIFO or BestEffort. A superhero job may preempt regular hero job based on priority. "Uberness" is between hero and regular workloads.

## Design Details


### API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One area I would like to have discussed a bit more in the KEP, maybe in the Notes section is how plan to extend the API with the new configuration options. For example if we have use cases for excluding a certain CQ from the mechanism, or have some weights between the CQs which balance the quota taken from the queues.

I find this one of the main advantages of the alternative ResourceQuotaLease KEP proposal by @ichekrygin of introducing the dedicated CRD, that the configuration place is natural. And the lifetime of the custom configuration is nicely managed. Here the lifetime of the custom configuration is bound to CQ, so I think we should think ahead how we make the configuration intuitive to users.

@ichekrygin
Copy link
Contributor

Definitely we can add some status information into the CQ - that a UCQ is around and running workloads so things may look weird. The KEP already mentions about workload observability and metrics.

Flushing out UCQ notifications to users could be a good mechanism to validate this KEP.

By making UCQ impact explicit and user-visible, especially at the ClusterQueue level, we can test whether the proposed behavior is understandable, discoverable, and actionable for workload owners who only have CQ-scoped visibility. If users can clearly see when and why their nominal capacity is affected by a UCQ, it becomes much easier to reason about guarantees, debug unexpected disruption, and build trust in the mechanism.

In that sense, well-defined UCQ notifications are not just an observability detail, they are a validation tool for whether the model itself is sound and usable in practice.

- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious if you've seen examples where users are mixing inference and training workloads on a cluster. Where in some circumstance, inference workloads need to act as an "uber" workload that preempts training workloads. So for example maybe there's a spillover from the normal inference clusters during high traffic times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting use case :). I haven't heard about that need, but sure, once can put whatever they like to UCQ, inference indluded.

@amy
Copy link
Contributor

amy commented Feb 5, 2026

+1 to @ichekrygin

Flushing out UCQ notifications to users could be a good mechanism to validate this KEP.

An overall concern I have for Kueue is instrumentation for scheduling decisions. Given in the past we've seen various bugs where its difficult to validate what the expected behavior should be. With the more explicit api contract @ichekrygin is proposing with leases, as an in between step that uber queues could interact to manipulate, we don't have to wonder as much about Kueue's internal in memory state.

Between this thread and the Lease thread... just want to confirm, it looks like lease is a pre-req for uber queues?

@mwielgus
Copy link
Contributor Author

mwielgus commented Feb 5, 2026

+1 to @ichekrygin

Flushing out UCQ notifications to users could be a good mechanism to validate this KEP.

An overall concern I have for Kueue is instrumentation for scheduling decisions. Given in the past we've seen various bugs where its difficult to validate what the expected behavior should be. With the more explicit api contract @ichekrygin is proposing with leases, as an in between step that uber queues could interact to manipulate, we don't have to wonder as much about Kueue's internal in memory state.

Between this thread and the Lease thread... just want to confirm, it looks like lease is a pre-req for uber queues?

Yes, some form of lease is required. UCQ does it automatically and kind-of behind the scenes, but sure, we could make it more explicit and visible.

@ichekrygin
Copy link
Contributor

Yes, some form of lease is required. UCQ does it automatically and kind-of behind the scenes, but sure, we could make it more explicit and visible.

It would be very useful to flesh those details out explicitly in the KEP.

@amy
Copy link
Contributor

amy commented Feb 5, 2026

Yes, some form of lease is required. UCQ does it automatically and kind-of behind the scenes, but sure, we could make it more explicit and visible.

@mwielgus More specifically. That leases can be used independently of UCQ? That's what I mean about pre-req.

Like @mimowo is mentioning in this comment for the interaction of UCQ & Lease: #8869 (comment)

I think we could somehow achieve it in the ResourceQuotaLease model. Maybe there is room to combine the two ideas: UberCQ configuration can say "Create ResourceQuotaLease when there is a workload pending in the CQ".

// +optional
// +kubebuilder:default=DefaultPreemptionRules
// +kubebuilder:validation:Enum=DefaultPreemptionRules;UberClusterQueueRules
Rules PreemptionRules `json:"rules,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this "PreemptionRules" a bit misleading, because it dictates not just preemption, but also scheduling more broadly, for example this paragraph shows that also borrowing works different: https://github.com/kubernetes-sigs/kueue/pull/8864/changes#diff-55ae50e78b080bbdc5104cae08ba81a13caa5853e0094fae8ce10237bc3bec9eR126-R128

Maybe this is about calling the options "SchedulingRules"

@mimowo
Copy link
Contributor

mimowo commented Feb 6, 2026

Yeah, the more I think about it the more I see some need for CRD, because:

  1. the UCQ feels "too magical" sentiment
  2. it is unclear to me where the extra configuration will be added, and I imagine the configuration can be big over time, but there is no "natural place", so I'm worried about overloading the spec.
  3. with CRD we could for example activate / deactivate the mode easily. The UCQ requires deleting the CQ, or at least unpinning it
  4. the extra CRD could evolve to naturally cover the use cases of other users mentioned in Support for Temporary Quota Overrides in ClusterQueue #8654, because it is a more generic model
  5. I think the "automatic" mode could also exist based on the CRD. The CRD, when present could appoint a selected CQ as "uber" (inversion of control). The appointed CQ could have exactly the same "automatic" rights as in this KEP.

@mwielgus
Copy link
Contributor Author

mwielgus commented Feb 6, 2026

Yeah, the more I think about it the more I see some need for CRD, because:

  1. the UCQ feels "too magical" sentiment
  2. it is unclear to me where the extra configuration will be added, and I imagine the configuration can be big over time, but there is no "natural place", so I'm worried about overloading the spec.
  3. with CRD we could for example activate / deactivate the mode easily. The UCQ requires deleting the CQ, or at least unpinning it
  4. the extra CRD could evolve to naturally cover the use cases of other users mentioned in Support for Temporary Quota Overrides in ClusterQueue #8654, because it is a more generic model
  5. I think the "automatic" mode could also exist based on the CRD. The CRD, when present could appoint a selected CQ as "uber" (inversion of control). The appointed CQ could have exactly the same "automatic" rights as in this KEP.

What do you want the CRD for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for running hero workloads

6 participants