Skip to content

Bugfix: deduplicate concurrent resolver cache requests with singleflight.#9365

Open
twoGiants wants to merge 1 commit intotektoncd:mainfrom
twoGiants:issue-9364-resolver-cache-race-condition
Open

Bugfix: deduplicate concurrent resolver cache requests with singleflight.#9365
twoGiants wants to merge 1 commit intotektoncd:mainfrom
twoGiants:issue-9364-resolver-cache-race-condition

Conversation

@twoGiants
Copy link
Member

@twoGiants twoGiants commented Feb 7, 2026

Changes

Fix race condition in resolver cache where concurrent requests for the same resource bypass the cache and independently resolve upstream (cache stampede / thundering herd).

What changed

Before this change, cache lookup (Get) and storage (Add) were separate operations exposed on resolverCache, and the orchestration lived in GetFromCacheOrResolve in operations.go.

Now the entire check-or-resolve flow is a single GetCachedOrResolveFromRemote method on resolverCache that wraps the resolve callback with golang.org/x/sync/singleflight. Only one in-flight resolution proceeds per cache key; all concurrent callers share its result. The separate Get, Add, Remove, and GetFromCacheOrResolve public methods are removed — the cache surface is now just GetCachedOrResolveFromRemote and Clear.

Other changes

  • Use strings.Builder in generateCacheKey instead of repeated string concatenation
  • Remove unused resolverType parameter from ShouldUse
  • Rework e2e tests to verify caching by counting actual registry GET requests (via container logs and registry /metrics endpoint) instead of checking resolver log messages
  • Add multi-replica resolver e2e test (4 replicas, 80 TaskRuns) to validate singleflight works per-replica
  • Allow -run flag to bypass category filtering in TestMain so individual tests can run in isolation
  • Minor markdown lint fixes in test/README.md

@vdemeester also found and fixed a small bug in the e2e test annotations. I wasn't able to run tests in isolation. Now the little fix does the magic 🐱.

Fixes #9364

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix resolver cache race condition causing duplicate upstream pulls under concurrent load.

@twoGiants twoGiants added the kind/bug Categorizes issue or PR as related to a bug. label Feb 7, 2026
@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 7, 2026
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from twogiants after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 7, 2026
@twoGiants
Copy link
Member Author

@waveywaves check out the fix! 😺

Copy link
Member

@aThorp96 aThorp96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through the tests yet but the fix looks good to me!

@waveywaves waveywaves self-assigned this Feb 12, 2026
@twoGiants twoGiants marked this pull request as draft February 14, 2026 18:37
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2026
@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch from 85f2292 to ffba4f2 Compare February 15, 2026 18:17
@twoGiants twoGiants marked this pull request as ready for review February 15, 2026 18:18
@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 15, 2026
@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch 2 times, most recently from feea6ff to db4118b Compare February 15, 2026 18:27
@twoGiants twoGiants requested a review from aThorp96 February 15, 2026 18:27
@twoGiants
Copy link
Member Author

/retest

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments

@waveywaves
Copy link
Member

/retest

@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch 2 times, most recently from 618f491 to 5fefa6d Compare February 28, 2026 10:37
Before this change, cache Get and Add were separate operations,
creating a TOCTOU race: concurrent requests for the same resource
could all miss the cache and each trigger a remote resolution.

Merge Get/Add/GetFromCacheOrResolve into a single
GetCachedOrResolveFromRemote method on resolverCache that wraps
the resolve callback with golang.org/x/sync/singleflight. Only
one in-flight resolution per cache key proceeds; all concurrent
callers share its result.

Other changes in this commit:
- Use strings.Builder in generateCacheKey instead of
  string concatenation
- Remove unused resolverType param from ShouldUse
- Rework e2e tests to verify caching by counting actual
  registry GET requests (via logs and metrics) instead of
  checking resolver log messages
- Add multi-replica resolver test (4 replicas, 200 TaskRuns)
- Allow -run flag to bypass category filtering in TestMain

Issue tektoncd#9364

Signed-off-by: Stanislav Jakuschevskij <stas@two-giants.com>
@twoGiants twoGiants force-pushed the issue-9364-resolver-cache-race-condition branch from 5fefa6d to 809b23e Compare February 28, 2026 10:49
@twoGiants twoGiants requested a review from vdemeester February 28, 2026 13:57
@twoGiants
Copy link
Member Author

We're good, take another look: @vdemeester @aThorp96

I updated the code, answered and marked the comments as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

No open projects
Status: Todo

Development

Successfully merging this pull request may close these issues.

Resolver cache race condition causes duplicate upstream pulls

5 participants