ensure memory requests/limits are reasonable #32175

upodroid · 2024-03-06T15:32:20Z

Some jobs had an outrageous amount of memory configured on them. I tweaked those down to sane values.

We want to make sure that jobs don't request more than 30Gi of memory so they can fit on the modern 8core 32gb VMs.

Before:

 mahamed  MAHAALI-M-2PY9  ~  Desktop  Git  k8s-test-infra   master  346⬆  6⚑  $  grep -rh memory: config/ | sed -e 's/^[ \t]*//' |tr -d "'\"" | sort | uniq -c
   1 #             memory: 1Gi
   4 memory: 1.2Gi
   1 memory: 100Mi
 154 memory: 10Gi
   8 memory: 1288490188800m
  42 memory: 12Gi
 118 memory: 14Gi
  20 memory: 15Gi
 113 memory: 16Gi
  28 memory: 1Gi
   6 memory: 2000Mi
  12 memory: 20Gi
  14 memory: 24Gi
   2 memory: 2500Mi
   6 memory: 256Mi
 124 memory: 2Gi
 164 memory: 32Gi
  10 memory: 34Gi
  34 memory: 36Gi
   2 memory: 39Gi
  70 memory: 3Gi
  12 memory: 40Gi
   2 memory: 41Gi
   2 memory: 48Gi
1055 memory: 4Gi
   1 memory: 50Mi
  20 memory: 512Mi
  10 memory: 64Gi
2136 memory: 6Gi
   6 memory: 8000Mi
 160 memory: 8Gi
 201 memory: 9000Mi
 631 memory: 9Gi

dims · 2024-03-06T15:41:55Z

/approve
/lgtm
/hold until you need to land this!

ameukam · 2024-03-06T15:46:45Z

FYI @kubernetes/release-engineering

cpanato

/lgtm

k8s-ci-robot · 2024-03-06T16:00:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cpanato, dims, upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [dims,upodroid]
~~config/jobs/kubernetes/sig-k8s-infra/trusted/OWNERS~~ [dims,upodroid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

upodroid · 2024-03-06T18:21:04Z

/hold cancel

BenTheElder · 2024-03-06T18:43:08Z

/hold

BenTheElder · 2024-03-06T18:45:43Z

Some jobs had an outrageous amount of memory configured on them. I tweaked those down to sane values.

Outrageous because they are large, or because we know they have no reason to make it?

Some of these are set very large to help ensure they're max QOS (scalability) but some jobs actually do use a LOT of ram.

I would like to actually make sure we check these before merging.

We want to make sure that jobs don't request more than 30Gi of memory so they can fit on the modern 8core 32gb VMs.

This isn't the only option on the table though, we have n2-highmem as an option.

BenTheElder · 2024-03-06T18:46:59Z

config/jobs/kubernetes/sig-testing/typecheck.yaml

@@ -23,10 +23,10 @@ presubmits:
        resources:
          limits:
            cpu: 5
-            memory: 32Gi
+            memory: 16Gi


This seems a bit low, considering even the 32/8 core machine has 4 gigs per core and typecheck is in fact memory intensive? We should not aim to use 100% of system memory but we can do say 18gi.

We're usually CPU bound for autoscaling anyhow (IE no jobs are requesting memory : cores in excess of the core : memory ratio of the host)

Same for other jobs like this one.

actually given we won't use all cores we can do 1:1 with host ratio, which would be 20 gi.

https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&refresh=30s&var-org=kubernetes&var-repo=kubernetes&var-job=pull-kubernetes-typecheck&var-build=All

I can make this one 20

BenTheElder · 2024-03-06T18:48:14Z

config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml

@@ -51,14 +51,14 @@ periodics:
      resources:
        requests:
          cpu: 6
-          memory: "39Gi"
+          memory: "16Gi"


scale jobs have over-provisioning to guarantee they're not eviction candidates because evicting them costs us a ton of wasted external resources. we could do a priority class instead but this works fine for our purposes.

and again, this is well below the CPU : memory ratio of the target hosts ? We're using 50% memory but 75% of cores

I'll apply the standard 7 cpu and 30Gi to run on its own.

Also, we shouldn't be evicting pods if the scheduler can't find a free node but spin up new nodes.

Preemption happened to a job I was checking yesterday. Scale jobs are some of the few where even rare preemption is realllly expensive both in wasted compute and in additional time to get signal.

k8s-ci-robot · 2024-03-11T06:35:24Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 · 2024-03-12T18:11:34Z

config/jobs/kubernetes-sigs/kueue/kueue-presubmits-release-0-4.yaml

@@ -92,10 +92,10 @@ presubmits:
        resources:
          requests:
            cpu: "10"
-            memory: "40Gi"
+            memory: "24Gi"


cc @tenzen-y @alculquicondor for kueue.

Thanks for the heads up. We reduced the limits in future versions. Changing it for the older version sgtm.

LGTM
Thanks!

k8s-ci-robot · 2024-04-09T22:40:57Z

@upodroid: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-test-infra-misc-image-build-test	`74020c3`	link	true	`/test pull-test-infra-misc-image-build-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-07-08T23:30:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-08T00:03:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-07T00:11:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-09-07T00:11:38Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ensure memory limits are reasonable

74020c3

k8s-ci-robot requested review from ameukam, BenTheElder and dims March 6, 2024 15:32

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2024

k8s-ci-robot assigned dims Mar 6, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024

cpanato approved these changes Mar 6, 2024

View reviewed changes

k8s-ci-robot assigned cpanato Mar 6, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2024

BenTheElder reviewed Mar 6, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2024

kannon92 reviewed Mar 12, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 8, 2024

k8s-ci-robot closed this Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure memory requests/limits are reasonable #32175

ensure memory requests/limits are reasonable #32175

upodroid commented Mar 6, 2024 •

edited

Loading

dims commented Mar 6, 2024

ameukam commented Mar 6, 2024

cpanato left a comment

k8s-ci-robot commented Mar 6, 2024

upodroid commented Mar 6, 2024

BenTheElder commented Mar 6, 2024

BenTheElder commented Mar 6, 2024

BenTheElder Mar 6, 2024

BenTheElder Mar 6, 2024

BenTheElder Mar 6, 2024

upodroid Mar 6, 2024

BenTheElder Mar 6, 2024 •

edited

Loading

upodroid Mar 6, 2024

BenTheElder Mar 6, 2024

k8s-ci-robot commented Mar 11, 2024

kannon92 Mar 12, 2024

alculquicondor Mar 12, 2024

tenzen-y Mar 12, 2024

k8s-ci-robot commented Apr 9, 2024

k8s-triage-robot commented Jul 8, 2024

k8s-triage-robot commented Aug 8, 2024

k8s-triage-robot commented Sep 7, 2024

k8s-ci-robot commented Sep 7, 2024

ensure memory requests/limits are reasonable #32175

ensure memory requests/limits are reasonable #32175

Conversation

upodroid commented Mar 6, 2024 • edited Loading

dims commented Mar 6, 2024

ameukam commented Mar 6, 2024

cpanato left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 6, 2024

upodroid commented Mar 6, 2024

BenTheElder commented Mar 6, 2024

BenTheElder commented Mar 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenTheElder Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 9, 2024

k8s-triage-robot commented Jul 8, 2024

k8s-triage-robot commented Aug 8, 2024

k8s-triage-robot commented Sep 7, 2024

k8s-ci-robot commented Sep 7, 2024

upodroid commented Mar 6, 2024 •

edited

Loading

BenTheElder Mar 6, 2024 •

edited

Loading