-
Notifications
You must be signed in to change notification settings - Fork 4k
sql: add safety gates for pausable portal cleanup and flow invalidation #155669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuzefovich reviewed 4 of 4 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ZhouXing19)
-- commits
line 23 at r1:
nit: I'd probably omit a release note here given that we cannot explain well to users which conditions were necessary for the bug to happen.
pkg/sql/conn_executor_exec.go
line 1201 at r1 (raw file):
updateRetErrAndPayload(retErr, retPayload) portal.pauseInfo.resumableFlow.cleanup.run(ctx) portal.pauseInfo.resumableFlow.flow = nil
nit: rather than unsetting flow
after each time cleanup
runs, we should modify the cleanup function itself, this will be more bullet-proof.
8c70293
to
7abf219
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I'd probably omit a release note here given that we cannot explain well to users which conditions were necessary for the bug to happen.
Done.
pkg/sql/conn_executor_exec.go
line 1201 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: rather than unsetting
flow
after each timecleanup
runs, we should modify the cleanup function itself, this will be more bullet-proof.
Good idea! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think it's worth backporting to 25.3 and 25.4, so I added the labels. Also make sure to update the PR description.
@yuzefovich reviewed 3 of 3 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ZhouXing19)
Epic: None Informs: cockroachlabs/support#3463 When a pausable portal encounters an error during execution, two issues can lead to panics on subsequent resume attempts: 1. The underlying FlowBase gets reset to nil during cleanup, but the portal's flow reference remains non-nil, causing hasFlowForPausablePortal() to incorrectly return true. 2. Errored portals are not removed from the portal map because deletion only occurs when execStmt() returns a non-nil fsm.Event. This change adds two defensive checks: - Nil the whole flow object hanging off the portalInfo while cleaning up the flow. - Ensure errored portals are properly cleaned up regardless of event state These gates prevent nil pointer dereferences when resuming portals that have been partially cleaned up due to errors. Release note: None
7abf219
to
9a13f8e
Compare
Informs: https://github.com/cockroachlabs/support/issues/3463
When a pausable portal encounters an error during execution, two issues can lead to panics on subsequent resume attempts:
The underlying FlowBase gets reset to nil during cleanup, but the portal's flow reference remains non-nil, causing hasFlowForPausablePortal() to incorrectly return true.
Errored portals are not removed from the portal map because deletion only occurs when execStmt() returns a non-nil fsm.Event.
This change adds two defensive checks:
portalInfo
while cleaning up the flow.These gates prevent nil pointer dereferences when resuming portals that have been partially cleaned up due to errors.
Release note: None