fix(pod-controller): deterministically remove topology scheduling gate#8884
fix(pod-controller): deterministically remove topology scheduling gate#8884Piyushkhobragade wants to merge 6 commits intokubernetes-sigs:mainfrom
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Piyushkhobragade The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @Piyushkhobragade. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
/retest |
| utilpod.Ungate(pod, kueue.TopologySchedulingGate) | ||
| } | ||
|
|
||
| // Topology gate must be removed unless explicitly re-added by PodSetInfo |
There was a problem hiding this comment.
I'm not sure this is the right fix. Have you had a chance to dig through the logs yet?
I think the issue is actually test isolation, not PodSetInfo staleness. The fix of unconditionally removing the topology gate would break TAS. For TAS workloads, that gate needs to stay until topology is actually assigned. I would rather focus on improving the test isolation.
There was a problem hiding this comment.
Yes, this is not the right fix. I'm wondering about #8824 - we should analyze the logs first. My suspicion is that as we run tests at high parallelism, then maybe 10s wasn't just enough.
There was a problem hiding this comment.
Thanks for the clarification — agreed.
I’ll revert the controller-side change and focus on analyzing the logs
for #8824. I’ll dig into the test failures under high parallelism and
see whether increasing timeouts or improving test isolation is the
right fix here. Will follow up with findings.
There was a problem hiding this comment.
This looks like from another PR
|
@Piyushkhobragade: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This fixes a race in the pod controller where the topology scheduling
gate could remain on pods even after the workload was admitted.
Gate removal previously relied on PodSetInfo.SchedulingGates, which can
be stale depending on reconciliation order. In some cases this left
pods gated indefinitely and caused the integration test to flake.
Make topology gate removal deterministic by always ungating it unless
it is explicitly re-added during podset merge.
Fixes #8824