Skip to content

Bugfix: deduplicate concurrent resolver cache requests with singleflight.#9365

Merged
tekton-robot merged 2 commits intotektoncd:mainfrom
twoGiants:issue-9364-resolver-cache-race-condition
Mar 24, 2026
Merged

Bugfix: deduplicate concurrent resolver cache requests with singleflight.#9365
tekton-robot merged 2 commits intotektoncd:mainfrom
twoGiants:issue-9364-resolver-cache-race-condition

Conversation

@twoGiants
Copy link
Copy Markdown
Member

@twoGiants twoGiants commented Feb 7, 2026

Changes

Fix race condition in resolver cache where concurrent requests for the same resource bypass the cache and independently resolve upstream (cache stampede / thundering herd).

What changed

Before this change, cache lookup (Get) and storage (Add) were separate operations exposed on resolverCache, and the orchestration lived in GetFromCacheOrResolve in operations.go.

Now the entire check-or-resolve flow is a single GetCachedOrResolveFromRemote method on resolverCache that wraps the resolve callback with golang.org/x/sync/singleflight. Only one in-flight resolution proceeds per cache key; all concurrent callers share its result. The separate Get, Add, Remove, and GetFromCacheOrResolve public methods are removed — the cache surface is now just GetCachedOrResolveFromRemote and Clear.

Other changes

  • Use strings.Builder in generateCacheKey instead of repeated string concatenation
  • Remove unused resolverType parameter from ShouldUse
  • Rework e2e tests to verify caching by counting actual registry GET requests (via container logs and registry /metrics endpoint) instead of checking resolver log messages
  • Add multi-replica resolver e2e test (4 replicas, 80 TaskRuns) to validate singleflight works per-replica
  • Allow -run flag to bypass category filtering in TestMain so individual tests can run in isolation
  • Minor markdown lint fixes in test/README.md

@vdemeester also found and fixed a small bug in the e2e test annotations. I wasn't able to run tests in isolation. Now the little fix does the magic 🐱.

Fixes #9364

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix resolver cache race condition causing duplicate upstream pulls under concurrent load.

@twoGiants twoGiants added the kind/bug Categorizes issue or PR as related to a bug. label Feb 7, 2026
@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 7, 2026
@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 7, 2026
@twoGiants
Copy link
Copy Markdown
Member Author

@waveywaves check out the fix! 😺

Copy link
Copy Markdown
Member

@aThorp96 aThorp96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through the tests yet but the fix looks good to me!

@waveywaves waveywaves self-assigned this Feb 12, 2026
@twoGiants twoGiants marked this pull request as draft February 14, 2026 18:37
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2026
@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch from 85f2292 to ffba4f2 Compare February 15, 2026 18:17
@twoGiants twoGiants marked this pull request as ready for review February 15, 2026 18:18
@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 15, 2026
@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch 2 times, most recently from feea6ff to db4118b Compare February 15, 2026 18:27
@twoGiants twoGiants requested a review from aThorp96 February 15, 2026 18:27
@twoGiants
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments

@waveywaves
Copy link
Copy Markdown
Member

/retest

@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch from db4118b to 618f491 Compare February 28, 2026 09:44
Comment on lines +65 to +73
// Check if user provided explicit -run flag
// If so, skip category filtering and run normally to respect user's choice
runFlag := flag.Lookup("test.run")
if runFlag != nil && runFlag.Value.String() != "" {
exitCode := m.Run()
fmt.Fprintf(os.Stderr, "Using kubeconfig at `%s` with cluster `%s`\n", knativetest.Flags.Kubeconfig, knativetest.Flags.Cluster)
os.Exit(exitCode)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😻

Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment (nit really) and same comments as @afrittoli, otherwise looks good to me !

runFlag := flag.Lookup("test.run")
if runFlag != nil && runFlag.Value.String() != "" {
exitCode := m.Run()
fmt.Fprintf(os.Stderr, "Using kubeconfig at `%s` with cluster `%s`\n", knativetest.Flags.Kubeconfig, knativetest.Flags.Cluster)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be "before" m.Run() though ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I moved it after flag.Parse at the top and removed duplication. Now it's shown for every case before the e2e tests run, which is better anyway 😸

Before this change, cache Get and Add were separate operations,
creating a TOCTOU race: concurrent requests for the same resource
could all miss the cache and each trigger a remote resolution.

Merge Get/Add/GetFromCacheOrResolve into a single
GetCachedOrResolveFromRemote method on resolverCache that wraps
the resolve callback with golang.org/x/sync/singleflight. Only
one in-flight resolution per cache key proceeds; all concurrent
callers share its result.

Other changes in this commit:
- Use strings.Builder in generateCacheKey instead of
  string concatenation
- Remove unused resolverType param from ShouldUse
- Rework e2e tests to verify caching by counting actual
  registry GET requests (via logs and metrics) instead of
  checking resolver log messages
- Add multi-replica resolver test (4 replicas, 200 TaskRuns)
- Allow -run flag to bypass category filtering in TestMain
- Log when resolution was deduped by singleflight

Issue tektoncd#9364

Signed-off-by: Stanislav Jakuschevskij <stas@two-giants.com>
Before, a corrupted cache entry (one that fails type assertion)
would return an error but stay in the cache, blocking all future
lookups for that key until TTL expiration or LRU eviction.

Now the corrupted entry is removed immediately so the next
request triggers a fresh resolution.

Also deduplicate the kubeconfig log line in TestMain by moving
it before the category-filtering branches.

Issue tektoncd#9364
@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch from 5d9c078 to 7de253c Compare March 21, 2026 10:35
@twoGiants
Copy link
Copy Markdown
Member Author

twoGiants commented Mar 21, 2026

@afrittoli @khrm @vdemeester Thank you for taking a look 😸 👍

I addressed your comments in the latest commit 7de253c PTAL. If you approve I squash or you can merge as is => let me know.

@twoGiants
Copy link
Copy Markdown
Member Author

/retest

@tekton-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [afrittoli,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vdemeester
Copy link
Copy Markdown
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 24, 2026
@tekton-robot tekton-robot merged commit 5df99a0 into tektoncd:main Mar 24, 2026
37 of 39 checks passed
Copy link
Copy Markdown
Member

@waveywaves waveywaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this fix @twoGiants 🤗

@vdemeester vdemeester deleted the issue-9364-resolver-cache-race-condition branch March 25, 2026 08:44
@vdemeester
Copy link
Copy Markdown
Member

@twoGiants @waveywaves should we cherry-pick this ?

@vdemeester vdemeester added this to the v1.11.0 milestone Mar 25, 2026
@waveywaves
Copy link
Copy Markdown
Member

/cherry-pick release-v1.10.x

@waveywaves
Copy link
Copy Markdown
Member

/cherry-pick release-v1.9.x

@waveywaves
Copy link
Copy Markdown
Member

/cherry-pick release-v1.6.x

@tekton-robot
Copy link
Copy Markdown
Collaborator

Cherry-pick to release-v1.6.x failed!

The automatic cherry-pick to release-v1.6.x failed.

Output:

🤖 Starting cherry-pick process...
Fetching PR #9365 information...
Found merge commit: 5df99a05d5fb8fcb13e17103736bfd715a617927
PR title: Bugfix: deduplicate concurrent resolver cache requests with singleflight.
Fetching target branch: release-v1.6.x...
From https://github.com/tektoncd/pipeline
 * branch                release-v1.6.x -> FETCH_HEAD
Checking for existing cherry-pick PR...
Creating cherry-pick branch: cherry-pick-9365-to-release-v1.6.x...
Switched to a new branch 'cherry-pick-9365-to-release-v1.6.x'
branch 'cherry-pick-9365-to-release-v1.6.x' set up to track 'origin/release-v1.6.x'.
Fetching commits from PR #9365...
Found 2 commit(s) to cherry-pick
Fetching PR ref to make fork commits available locally...
From https://github.com/tektoncd/pipeline
 * branch                refs/pull/9365/head -> FETCH_HEAD
Cherry-picking commit 1/2: fbf4493863d01de186bd93388b782655ae8e8e2f...
Auto-merging pkg/remoteresolution/resolver/framework/cache/cache.go
CONFLICT (content): Merge conflict in pkg/remoteresolution/resolver/framework/cache/cache.go
Auto-merging pkg/remoteresolution/resolver/framework/cache/cache_test.go
CONFLICT (content): Merge conflict in pkg/remoteresolution/resolver/framework/cache/cache_test.go
Auto-merging pkg/remoteresolution/resolver/git/resolver.go
Auto-merging test/README.md
CONFLICT (content): Merge conflict in test/README.md
Auto-merging test/init_test.go
CONFLICT (content): Merge conflict in test/init_test.go
Auto-merging test/registry_test.go
Auto-merging test/resolver_cache_test.go
CONFLICT (content): Merge conflict in test/resolver_cache_test.go
error: could not apply fbf449386... fix: resolve cache race with singleflight
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
❌ ERROR: Cherry-pick failed for commit fbf4493863d01de186bd93388b782655ae8e8e2f due to conflicts or other errors

Next steps:

  • Check the action logs for complete details
  • If the PR is not merged, merge it first and try again
  • If there are conflicts, you'll need to manually cherry-pick this PR

@tekton-robot
Copy link
Copy Markdown
Collaborator

Cherry-pick to release-v1.10.x successful!

A new pull request has been created to cherry-pick this change to release-v1.10.x.

PR: #9662

Please review and merge the cherry-pick PR.

@tekton-robot
Copy link
Copy Markdown
Collaborator

Cherry-pick to release-v1.9.x successful!

A new pull request has been created to cherry-pick this change to release-v1.9.x.

PR: #9663

Please review and merge the cherry-pick PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

No open projects
Status: Todo

Development

Successfully merging this pull request may close these issues.

Resolver cache race condition causes duplicate upstream pulls

7 participants