Skip to content

Add some race-condition protection to VPA recommender #8320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

adrianmoisey
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

#7992 introduces goroutines in order to process a batch of VPAs quicker. This inadvertently lead to some race conditions that we're picked up in our tests.
This PR attempts to fix those race conditions

Which issue(s) this PR fixes:

Fixes #8318

Special notes for your reviewer:

Opening this PR early, I still plan to write some tests. I'd like to make a breaking-test that these fixes solve.
/hold

Does this PR introduce a user-facing change?

NONE

Also move it to a private field, so that reads/writes are all via the protected methods
From https://pkg.go.dev/flag#PrintDefaults

> The listed type, here int, can be changed by placing a back-quoted
> name in the flag's usage string; the first such item in the message
> is taken to be a parameter name to show in the message and the back
> quotes are stripped from the message when displayed.
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Jul 13, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianmoisey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-area labels Jul 13, 2025
@@ -125,7 +125,7 @@ This document is auto-generated from the flag definitions in the VPA recommender
| `storage` | string | | Specifies storage mode. Supported values: prometheus, checkpoint |
| `target-cpu-percentile` | float | 0.9 | CPU usage percentile that will be used as a base for CPU target recommendation. Doesn't affect CPU lower bound, CPU upper bound nor memory recommendations. |
| `target-memory-percentile` | float | 0.9 | Memory usage percentile that will be used as a base for memory target recommendation. Doesn't affect memory lower bound nor memory upper bound. |
| `update-worker-count` | | 10 | kube-api-qps Number of concurrent workers to update VPA recommendations and checkpoints. When increasing this setting, make sure the client-side rate limits (kube-api-qps and `kube-api-burst`) are either increased or turned off as well. Determines the minimum number of VPA checkpoints written per recommender loop. |
| `update-worker-count` | int | 10 | Number of concurrent workers to update VPA recommendations and checkpoints. When increasing this setting, make sure the client-side rate limits ('kube-api-qps' and 'kube-api-burst') are either increased or turned off as well. Determines the minimum number of VPA checkpoints written per recommender loop. |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happen to notice this weirdness in the --update-worker-count due to the back-ticks, so I fixed them while I was here.

Reference: https://pkg.go.dev/flag#PrintDefaults

The listed type, here int, can be changed by placing a back-quoted name in the flag's usage string; the first such item in the message is taken to be a parameter name to show in the message and the back quotes are stripped from the message when displayed

@voelzmo voelzmo mentioned this pull request Jul 14, 2025
Comment on lines +194 to +196
oc.mutex.Lock()
oc.cnt[key]++
oc.mutex.Unlock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at changing this to use atomic.AddInt64().

It works, but my concern is that the map need to gaurentee that all keys exist in the map. This is possible by pre-filling the map with all possible combinations. However, we need to remember to update this file each time a new type is added. Ie:

modes = []string{
string(vpa_types.UpdateModeOff),
string(vpa_types.UpdateModeInitial),
string(vpa_types.UpdateModeRecreate),
string(vpa_types.UpdateModeAuto),
}

The problem is that there doesn't seem to be a good way to dynamically generate that list in the VPA.
What we may be able to do is write a generate script that will keep this file up to date, which we can run in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VPA race conditions
2 participants