-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure memory requests/limits are reasonable #32175
Conversation
/approve |
FYI @kubernetes/release-engineering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cpanato, dims, upodroid The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
/hold |
Outrageous because they are large, or because we know they have no reason to make it? Some of these are set very large to help ensure they're max QOS (scalability) but some jobs actually do use a LOT of ram. I would like to actually make sure we check these before merging.
This isn't the only option on the table though, we have n2-highmem as an option. |
@@ -23,10 +23,10 @@ presubmits: | |||
resources: | |||
limits: | |||
cpu: 5 | |||
memory: 32Gi | |||
memory: 16Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit low, considering even the 32/8 core machine has 4 gigs per core and typecheck is in fact memory intensive? We should not aim to use 100% of system memory but we can do say 18gi.
We're usually CPU bound for autoscaling anyhow (IE no jobs are requesting memory : cores in excess of the core : memory ratio of the host)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for other jobs like this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually given we won't use all cores we can do 1:1 with host ratio, which would be 20 gi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -51,14 +51,14 @@ periodics: | |||
resources: | |||
requests: | |||
cpu: 6 | |||
memory: "39Gi" | |||
memory: "16Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scale jobs have over-provisioning to guarantee they're not eviction candidates because evicting them costs us a ton of wasted external resources. we could do a priority class instead but this works fine for our purposes.
and again, this is well below the CPU : memory ratio of the target hosts ? We're using 50% memory but 75% of cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll apply the standard 7 cpu and 30Gi to run on its own.
Also, we shouldn't be evicting pods if the scheduler can't find a free node but spin up new nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemption happened to a job I was checking yesterday. Scale jobs are some of the few where even rare preemption is realllly expensive both in wasted compute and in additional time to get signal.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -92,10 +92,10 @@ presubmits: | |||
resources: | |||
requests: | |||
cpu: "10" | |||
memory: "40Gi" | |||
memory: "24Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tenzen-y @alculquicondor for kueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up. We reduced the limits in future versions. Changing it for the older version sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks!
@upodroid: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Required for kubernetes/k8s.io#6525
/cc @BenTheElder @dims @ameukam
Some jobs had an outrageous amount of memory configured on them. I tweaked those down to sane values.
We want to make sure that jobs don't request more than 30Gi of memory so they can fit on the modern 8core 32gb VMs.
Before: