Skip to content

Conversation

@saschagrunert
Copy link
Member

When a GitHub Action is re-triggered, GitHub temporarily removes the old check status before the new run starts. This causes a race where Tide may merge the PR during the brief window when the check is missing.

This fix tracks previously seen contexts per PR/commit. When a required context disappears, it's treated as PENDING to prevent premature merging.

Fixes #337

@netlify
Copy link

netlify bot commented Dec 4, 2025

Deploy Preview for k8s-prow ready!

Name Link
🔨 Latest commit 75d9fb6
🔍 Latest deploy log https://app.netlify.com/projects/k8s-prow/deploys/6931568dda24f300084d3284
😎 Deploy Preview https://deploy-preview-563--k8s-prow.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: saschagrunert
Once this PR has been reviewed and has the lgtm label, please assign cjwagner for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from cjwagner December 4, 2025 09:21
@k8s-ci-robot k8s-ci-robot added the area/tide Issues or PRs related to prow's tide component label Dec 4, 2025
@k8s-ci-robot k8s-ci-robot requested a review from jmguzik December 4, 2025 09:21
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 4, 2025
@saschagrunert saschagrunert changed the title Fix race condition when re-triggering GitHub Actions WIP: Fix race condition when re-triggering GitHub Actions Dec 4, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2025
@saschagrunert saschagrunert force-pushed the fix-tide-retest-race-337 branch from d901412 to 09ac330 Compare December 4, 2025 09:24
@saschagrunert saschagrunert force-pushed the fix-tide-retest-race-337 branch from 09ac330 to 9c0286a Compare December 4, 2025 09:28
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 4, 2025
@saschagrunert saschagrunert force-pushed the fix-tide-retest-race-337 branch from 9c0286a to 5552e4b Compare December 4, 2025 09:34
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 4, 2025
When a GitHub Action is re-triggered, GitHub temporarily removes the old
check status before the new run starts. This causes a race where Tide may
merge the PR during the brief window when the check is missing.

This fix tracks previously seen contexts per PR (not per commit, so it
works across force pushes). When a required context disappears, it's
treated as PENDING to prevent premature merging.

The context history is automatically pruned each sync to prevent memory
leaks, and duplicates are avoided when a context is both missing and
disappeared.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@saschagrunert saschagrunert force-pushed the fix-tide-retest-race-337 branch from 5552e4b to 75d9fb6 Compare December 4, 2025 09:38
@saschagrunert saschagrunert changed the title WIP: Fix race condition when re-triggering GitHub Actions Fix race condition when re-triggering GitHub Actions Dec 4, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2025
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
- Document issue summary and timeline
- Identify suspected race condition in tide status checking
- Note PR kubernetes-sigs#563 is working on a fix
- Define next steps for investigation
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
Completed initial validation subcommand analysis:
- Issue is a legitimate bug in Tide component (pkg/tide/status.go)
- Describes race condition when re-triggering GitHub Actions
- Provides comprehensive information: example PR, code reference, screenshots
- Component verified to exist in this repository
- Fix already in progress via PR kubernetes-sigs#563
- Recommendation: Keep open and continue triage
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
Completed comprehensive code investigation:
- Analyzed Tide status checking implementation and architecture
- Identified root cause: GitHub removes old CheckRun before creating new one during re-trigger
- Documented key code paths: isPassingTests, unsuccessfulContexts, headContexts, checkRunToContext
- Reviewed test coverage and identified gaps
- Analyzed PR kubernetes-sigs#563's solution approach
- Proposed 3 architectural solutions with trade-offs
- Recommended Approach 1: Track previously seen contexts (PR kubernetes-sigs#563's approach)
- Provided detailed implementation considerations and testing requirements
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
Completed comprehensive effort assessment:
- Assessed Level 3 (Large - Requires Expertise)
- Analyzed 8 factors: scope (moderate), complexity (high), expertise (deep), clarity (well-defined), testing (complex), backwards compat (fully compatible), architecture (good fit), external deps (well-supported)
- Recommended labels: area/tide, kind/bug, priority/important-soon
- Explicitly NOT recommended: good-first-issue, help-needed
- Provided detailed guidance for Level 3 contributors
- Explained why Level 3 (not 2 or 4): race condition complexity + critical merge path
- Noted PR kubernetes-sigs#563 already has implementation
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
- No title change needed (already clear and specific)
- Added missing context: root cause explanation and technical mechanism
- Explains GitHub's CheckRun removal behavior during re-trigger
- References PR kubernetes-sigs#563 which implements the fix
- Labels: /priority important-soon (can cause incorrect merges)
- No difficulty label (Level 3 issue, requires expertise)
- Recommendation: Post this comment (adds valuable context)
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
- Document issue summary and timeline
- Identify suspected race condition in tide status checking
- Note PR kubernetes-sigs#563 is working on a fix
- Define next steps for investigation
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
Completed initial validation subcommand analysis:
- Issue is a legitimate bug in Tide component (pkg/tide/status.go)
- Describes race condition when re-triggering GitHub Actions
- Provides comprehensive information: example PR, code reference, screenshots
- Component verified to exist in this repository
- Fix already in progress via PR kubernetes-sigs#563
- Recommendation: Keep open and continue triage
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
Completed comprehensive code investigation:
- Analyzed Tide status checking implementation and architecture
- Identified root cause: GitHub removes old CheckRun before creating new one during re-trigger
- Documented key code paths: isPassingTests, unsuccessfulContexts, headContexts, checkRunToContext
- Reviewed test coverage and identified gaps
- Analyzed PR kubernetes-sigs#563's solution approach
- Proposed 3 architectural solutions with trade-offs
- Recommended Approach 1: Track previously seen contexts (PR kubernetes-sigs#563's approach)
- Provided detailed implementation considerations and testing requirements
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
Completed comprehensive effort assessment:
- Assessed Level 3 (Large - Requires Expertise)
- Analyzed 8 factors: scope (moderate), complexity (high), expertise (deep), clarity (well-defined), testing (complex), backwards compat (fully compatible), architecture (good fit), external deps (well-supported)
- Recommended labels: area/tide, kind/bug, priority/important-soon
- Explicitly NOT recommended: good-first-issue, help-needed
- Provided detailed guidance for Level 3 contributors
- Explained why Level 3 (not 2 or 4): race condition complexity + critical merge path
- Noted PR kubernetes-sigs#563 already has implementation
petr-muller added a commit to petr-muller/prow that referenced this pull request Dec 23, 2025
- No title change needed (already clear and specific)
- Added missing context: root cause explanation and technical mechanism
- Explains GitHub's CheckRun removal behavior during re-trigger
- References PR kubernetes-sigs#563 which implements the fix
- Labels: /priority important-soon (can cause incorrect merges)
- No difficulty label (Level 3 issue, requires expertise)
- Recommendation: Post this comment (adds valuable context)
@petr-muller
Copy link
Contributor

Wow that GH spam is annoying, need to fix that automation to not link to PRs / issues this way.

I have a bit of a hard time wrapping my head around this fix:

  1. If a known required (by Tide) GH check is not passing (by not being momentarily present on a PR), how does it lead to a spurious merge?
  2. If for some reason ^^^ has some reasonable explanation, why do we need to track previously seen checks instead of relying on configuration-time knowledge that a certain check is required?

Maybe I understand wrong and the behavior fixed here is actually Tide not knowing anything about specific GH checks, but doing some kind of "unknown non-Prow results must PASS when present" behavior. Tide would not be configured to require a certain non-Prow check, but would be configured to require non-Prow checks to be passing, and the temporary absence of a certain non-Prow result would pass that criteria leading to a merge. Can you confirm?

I have not yet looked at the code closely, but I wonder if we can have a situation where a check goes away legitimately (when a job is removed between when it last runs on a commit and when we evaluate later). Can that happen? Skimming the change it seems we only expect checks that were present on the same commit in the past. At least status contexts cannot be removed from a commit so this looks good but I'm not sure about all the Check behaviors.

@saschagrunert
Copy link
Member Author

saschagrunert commented Jan 13, 2026

@petr-muller it happens pretty often when a retrigger a flaking GH action, for example in: kubernetes-sigs/cri-tools#1973 most recently:

  • The action failed, tide is complaining and won't merge.
  • I retrigger the GH action
  • Tide merges during the time where the context is not visible / available to it.

This leaves the PR in a state where the CI is still running but it got already merged, see:
screenshot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tide Issues or PRs related to prow's tide component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tide merges PR when retesting GitHub action

3 participants