Bugfix: deduplicate concurrent resolver cache requests with singleflight.#9365
Conversation
|
@waveywaves check out the fix! 😺 |
aThorp96
left a comment
There was a problem hiding this comment.
Haven't gone through the tests yet but the fix looks good to me!
85f2292 to
ffba4f2
Compare
feea6ff to
db4118b
Compare
|
/retest |
|
/retest |
db4118b to
618f491
Compare
| // Check if user provided explicit -run flag | ||
| // If so, skip category filtering and run normally to respect user's choice | ||
| runFlag := flag.Lookup("test.run") | ||
| if runFlag != nil && runFlag.Value.String() != "" { | ||
| exitCode := m.Run() | ||
| fmt.Fprintf(os.Stderr, "Using kubeconfig at `%s` with cluster `%s`\n", knativetest.Flags.Kubeconfig, knativetest.Flags.Cluster) | ||
| os.Exit(exitCode) | ||
| } | ||
|
|
vdemeester
left a comment
There was a problem hiding this comment.
One comment (nit really) and same comments as @afrittoli, otherwise looks good to me !
test/init_test.go
Outdated
| runFlag := flag.Lookup("test.run") | ||
| if runFlag != nil && runFlag.Value.String() != "" { | ||
| exitCode := m.Run() | ||
| fmt.Fprintf(os.Stderr, "Using kubeconfig at `%s` with cluster `%s`\n", knativetest.Flags.Kubeconfig, knativetest.Flags.Cluster) |
There was a problem hiding this comment.
shouldn't this be "before" m.Run() though ?
There was a problem hiding this comment.
Good point! I moved it after flag.Parse at the top and removed duplication. Now it's shown for every case before the e2e tests run, which is better anyway 😸
Before this change, cache Get and Add were separate operations, creating a TOCTOU race: concurrent requests for the same resource could all miss the cache and each trigger a remote resolution. Merge Get/Add/GetFromCacheOrResolve into a single GetCachedOrResolveFromRemote method on resolverCache that wraps the resolve callback with golang.org/x/sync/singleflight. Only one in-flight resolution per cache key proceeds; all concurrent callers share its result. Other changes in this commit: - Use strings.Builder in generateCacheKey instead of string concatenation - Remove unused resolverType param from ShouldUse - Rework e2e tests to verify caching by counting actual registry GET requests (via logs and metrics) instead of checking resolver log messages - Add multi-replica resolver test (4 replicas, 200 TaskRuns) - Allow -run flag to bypass category filtering in TestMain - Log when resolution was deduped by singleflight Issue tektoncd#9364 Signed-off-by: Stanislav Jakuschevskij <stas@two-giants.com>
Before, a corrupted cache entry (one that fails type assertion) would return an error but stay in the cache, blocking all future lookups for that key until TTL expiration or LRU eviction. Now the corrupted entry is removed immediately so the next request triggers a fresh resolution. Also deduplicate the kubeconfig log line in TestMain by moving it before the category-filtering branches. Issue tektoncd#9364
5d9c078 to
7de253c
Compare
|
@afrittoli @khrm @vdemeester Thank you for taking a look 😸 👍 I addressed your comments in the latest commit 7de253c PTAL. If you approve I squash or you can merge as is => let me know. |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
waveywaves
left a comment
There was a problem hiding this comment.
Thank you for this fix @twoGiants 🤗
|
@twoGiants @waveywaves should we cherry-pick this ? |
|
/cherry-pick release-v1.10.x |
|
/cherry-pick release-v1.9.x |
|
/cherry-pick release-v1.6.x |
|
❌ Cherry-pick to The automatic cherry-pick to Output: Next steps:
|
|
✅ Cherry-pick to A new pull request has been created to cherry-pick this change to PR: #9662 Please review and merge the cherry-pick PR. |
|
✅ Cherry-pick to A new pull request has been created to cherry-pick this change to PR: #9663 Please review and merge the cherry-pick PR. |
Changes
Fix race condition in resolver cache where concurrent requests for the same resource bypass the cache and independently resolve upstream (cache stampede / thundering herd).
What changed
Before this change, cache lookup (
Get) and storage (Add) were separate operations exposed onresolverCache, and the orchestration lived inGetFromCacheOrResolveinoperations.go.Now the entire check-or-resolve flow is a single
GetCachedOrResolveFromRemotemethod onresolverCachethat wraps the resolve callback withgolang.org/x/sync/singleflight. Only one in-flight resolution proceeds per cache key; all concurrent callers share its result. The separateGet,Add,Remove, andGetFromCacheOrResolvepublic methods are removed — the cache surface is now justGetCachedOrResolveFromRemoteandClear.Other changes
strings.BuilderingenerateCacheKeyinstead of repeated string concatenationresolverTypeparameter fromShouldUse/metricsendpoint) instead of checking resolver log messages-runflag to bypass category filtering inTestMainso individual tests can run in isolationtest/README.md@vdemeester also found and fixed a small bug in the e2e test annotations. I wasn't able to run tests in isolation. Now the little fix does the magic 🐱.
Fixes #9364
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes