-
Notifications
You must be signed in to change notification settings - Fork 174
Fix race condition when re-triggering GitHub Actions #563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix race condition when re-triggering GitHub Actions #563
Conversation
✅ Deploy Preview for k8s-prow ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: saschagrunert The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
d901412 to
09ac330
Compare
09ac330 to
9c0286a
Compare
9c0286a to
5552e4b
Compare
When a GitHub Action is re-triggered, GitHub temporarily removes the old check status before the new run starts. This causes a race where Tide may merge the PR during the brief window when the check is missing. This fix tracks previously seen contexts per PR (not per commit, so it works across force pushes). When a required context disappears, it's treated as PENDING to prevent premature merging. The context history is automatically pruned each sync to prevent memory leaks, and duplicates are avoided when a context is both missing and disappeared. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
5552e4b to
75d9fb6
Compare
- Document issue summary and timeline - Identify suspected race condition in tide status checking - Note PR kubernetes-sigs#563 is working on a fix - Define next steps for investigation
Completed initial validation subcommand analysis: - Issue is a legitimate bug in Tide component (pkg/tide/status.go) - Describes race condition when re-triggering GitHub Actions - Provides comprehensive information: example PR, code reference, screenshots - Component verified to exist in this repository - Fix already in progress via PR kubernetes-sigs#563 - Recommendation: Keep open and continue triage
Completed comprehensive code investigation: - Analyzed Tide status checking implementation and architecture - Identified root cause: GitHub removes old CheckRun before creating new one during re-trigger - Documented key code paths: isPassingTests, unsuccessfulContexts, headContexts, checkRunToContext - Reviewed test coverage and identified gaps - Analyzed PR kubernetes-sigs#563's solution approach - Proposed 3 architectural solutions with trade-offs - Recommended Approach 1: Track previously seen contexts (PR kubernetes-sigs#563's approach) - Provided detailed implementation considerations and testing requirements
Completed comprehensive effort assessment: - Assessed Level 3 (Large - Requires Expertise) - Analyzed 8 factors: scope (moderate), complexity (high), expertise (deep), clarity (well-defined), testing (complex), backwards compat (fully compatible), architecture (good fit), external deps (well-supported) - Recommended labels: area/tide, kind/bug, priority/important-soon - Explicitly NOT recommended: good-first-issue, help-needed - Provided detailed guidance for Level 3 contributors - Explained why Level 3 (not 2 or 4): race condition complexity + critical merge path - Noted PR kubernetes-sigs#563 already has implementation
- No title change needed (already clear and specific) - Added missing context: root cause explanation and technical mechanism - Explains GitHub's CheckRun removal behavior during re-trigger - References PR kubernetes-sigs#563 which implements the fix - Labels: /priority important-soon (can cause incorrect merges) - No difficulty label (Level 3 issue, requires expertise) - Recommendation: Post this comment (adds valuable context)
- Document issue summary and timeline - Identify suspected race condition in tide status checking - Note PR kubernetes-sigs#563 is working on a fix - Define next steps for investigation
Completed initial validation subcommand analysis: - Issue is a legitimate bug in Tide component (pkg/tide/status.go) - Describes race condition when re-triggering GitHub Actions - Provides comprehensive information: example PR, code reference, screenshots - Component verified to exist in this repository - Fix already in progress via PR kubernetes-sigs#563 - Recommendation: Keep open and continue triage
Completed comprehensive code investigation: - Analyzed Tide status checking implementation and architecture - Identified root cause: GitHub removes old CheckRun before creating new one during re-trigger - Documented key code paths: isPassingTests, unsuccessfulContexts, headContexts, checkRunToContext - Reviewed test coverage and identified gaps - Analyzed PR kubernetes-sigs#563's solution approach - Proposed 3 architectural solutions with trade-offs - Recommended Approach 1: Track previously seen contexts (PR kubernetes-sigs#563's approach) - Provided detailed implementation considerations and testing requirements
Completed comprehensive effort assessment: - Assessed Level 3 (Large - Requires Expertise) - Analyzed 8 factors: scope (moderate), complexity (high), expertise (deep), clarity (well-defined), testing (complex), backwards compat (fully compatible), architecture (good fit), external deps (well-supported) - Recommended labels: area/tide, kind/bug, priority/important-soon - Explicitly NOT recommended: good-first-issue, help-needed - Provided detailed guidance for Level 3 contributors - Explained why Level 3 (not 2 or 4): race condition complexity + critical merge path - Noted PR kubernetes-sigs#563 already has implementation
- No title change needed (already clear and specific) - Added missing context: root cause explanation and technical mechanism - Explains GitHub's CheckRun removal behavior during re-trigger - References PR kubernetes-sigs#563 which implements the fix - Labels: /priority important-soon (can cause incorrect merges) - No difficulty label (Level 3 issue, requires expertise) - Recommendation: Post this comment (adds valuable context)
|
Wow that GH spam is annoying, need to fix that automation to not link to PRs / issues this way. I have a bit of a hard time wrapping my head around this fix:
Maybe I understand wrong and the behavior fixed here is actually Tide not knowing anything about specific GH checks, but doing some kind of "unknown non-Prow results must PASS when present" behavior. Tide would not be configured to require a certain non-Prow check, but would be configured to require non-Prow checks to be passing, and the temporary absence of a certain non-Prow result would pass that criteria leading to a merge. Can you confirm? I have not yet looked at the code closely, but I wonder if we can have a situation where a check goes away legitimately (when a job is removed between when it last runs on a commit and when we evaluate later). Can that happen? Skimming the change it seems we only expect checks that were present on the same commit in the past. At least status contexts cannot be removed from a commit so this looks good but I'm not sure about all the Check behaviors. |
|
@petr-muller it happens pretty often when a retrigger a flaking GH action, for example in: kubernetes-sigs/cri-tools#1973 most recently:
This leaves the PR in a state where the CI is still running but it got already merged, see: |

When a GitHub Action is re-triggered, GitHub temporarily removes the old check status before the new run starts. This causes a race where Tide may merge the PR during the brief window when the check is missing.
This fix tracks previously seen contexts per PR/commit. When a required context disappears, it's treated as PENDING to prevent premature merging.
Fixes #337