-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s-infra-prow-builds is frequently failing to schedule due to capacity #32157
Comments
We still have recent failure to schedule but we're only at 130 nodes currently, and AFAICT we have set the limit to 1-80 per zone ... |
https://prow.k8s.io/?state=error has dropped off for the moment, possibly following #32157 🤞 Last error pod was scheduled at 3:11 Pacific, config updated at 3:24 |
technically unrelated but similar issue: kubernetes/k8s.io#6519 (boskos pool exhausting quota) |
Still happening.
Schrödinger's scale-up??
Maybe we need to increase It seems like we're scaling up but not before Prow gives up.
Doesn't make sense. We're requesting 7 cores, the nodes are 8 core, and I don't see that we're anywhere near exhausting GCE CPU quota. And then it did also scale up ... |
Possibly due to system pods ..? (confusing the system if adding a node would help as we run right up against the limit? also maybe one is using more CPU now?) On a node successfully running pull-kubernetes-e2e-kind:
total / allocatable / requested. 7.1 of that is the test pod, .1 of which is sidecar |
This impacts jobs that:
It appears to have gotten bad sometime just before 10am pacific. |
We could do one of:
We partially did 1) which we would have done for some of these anyhow. Ideally we'd root cause and resolve the apparent scaling decision flap, in any case I'm logging off for the night. All of this infra is managed in this repo or github.com/kubernetes/k8s.io at least, so someone else could pick this up in the interim. I'm going to be in meetings all morning unfortunately. |
this is probably due to kubernetes/k8s.io#6468 which adds a new daemonset requesting .2 CPU limit, meanwhile these jobs were requesting almost 100% of schedulable CPU by design (to avoid noisy neighbors, they're very I/O and CPU heavy) |
Given code freeze is pending in like one day, we should probably revert for now and then evaluate follow-up options? This is having significant impact on merging to kubernetes as required presubmits are failing to schedule. |
This should be mitigated now as I deleted the daemonset. |
Yes, it appears to be: https://prow.k8s.io/?state=error |
So to conclude: We schedule many jobs that use ~100% of available CPU because: For a long time that has meant requesting 7 cores (+0.1 for prow's sidecar), since we've run on 8 core nodes and there are some system reserved covering part of the remaining core and no job is requesting <1 core. Looking at
So we can't fit the 200m CPU daemonset (kubernetes/k8s.io#6521) and that breaks auto-scaling. Pods for sample node running 7.1 core prowjob:
Least loaded node with 7.1 core prowjob:
We only have .06 CPU overhead at most currently for nodes running these jobs. Pods on that node:
We either have to keep daemonset additions extremely negligible, or we need to reduce the CPU available to these heavy jobs (and that means identify and updating ALL of them to prevent leaving jobs failing to schedule). Presumably, we have slightly different resources available on the EKS nodes, enough to fit this daemonset alongside while still scheduling 7.1 cores, but we fundamentally have the same risk there. Additionally: We ensure all of our jobs have guaranteed QOS via presubmit tests for the jobs, we should probably be doing at least manually doing this for anything else we install. The create dev loops and tune sysctls are an exception because they're doing almost nothing and don't really need any guaranteed resources. |
@upodroid points out here kubernetes/k8s.io#6525 (comment) that we should probably just disable calico network policy and get back .4 CPU/node for custom metrics daemonsets. We are not running it on the old build cluster and I don't think we need it. |
https://prow.k8s.io/?state=error&cluster=k8s-infra-prow-build
failures like:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/122422/pull-kubernetes-verify/1764789037396135936
xref https://kubernetes.slack.com/archives/C09QZ4DQB/p1709577399565409
filed #32156 for quick fix
/sig testing k8s-infra
The text was updated successfully, but these errors were encountered: