Skip to content

Conversation

ZhouXing19
Copy link
Collaborator

@ZhouXing19 ZhouXing19 commented Oct 18, 2025

Informs: https://github.com/cockroachlabs/support/issues/3463

When a pausable portal encounters an error during execution, two issues can lead to panics on subsequent resume attempts:

  1. The underlying FlowBase gets reset to nil during cleanup, but the portal's flow reference remains non-nil, causing hasFlowForPausablePortal() to incorrectly return true.

  2. Errored portals are not removed from the portal map because deletion only occurs when execStmt() returns a non-nil fsm.Event.

This change adds two defensive checks:

  • Nil the whole flow object hanging off the portalInfo while cleaning up the flow.
  • Ensure errored portals are properly cleaned up regardless of event state

These gates prevent nil pointer dereferences when resuming portals that have been partially cleaned up due to errors.

Release note: None

@blathers-crl
Copy link

blathers-crl bot commented Oct 18, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuzefovich reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ZhouXing19)


-- commits line 23 at r1:
nit: I'd probably omit a release note here given that we cannot explain well to users which conditions were necessary for the bug to happen.


pkg/sql/conn_executor_exec.go line 1201 at r1 (raw file):

			updateRetErrAndPayload(retErr, retPayload)
			portal.pauseInfo.resumableFlow.cleanup.run(ctx)
			portal.pauseInfo.resumableFlow.flow = nil

nit: rather than unsetting flow after each time cleanup runs, we should modify the cleanup function itself, this will be more bullet-proof.

@ZhouXing19 ZhouXing19 force-pushed the portal-gate branch 2 times, most recently from 8c70293 to 7abf219 Compare October 21, 2025 21:16
Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


-- commits line 23 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I'd probably omit a release note here given that we cannot explain well to users which conditions were necessary for the bug to happen.

Done.


pkg/sql/conn_executor_exec.go line 1201 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: rather than unsetting flow after each time cleanup runs, we should modify the cleanup function itself, this will be more bullet-proof.

Good idea! Done.

@yuzefovich yuzefovich added backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 labels Oct 22, 2025
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm: I think it's worth backporting to 25.3 and 25.4, so I added the labels. Also make sure to update the PR description.

@yuzefovich reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ZhouXing19)

Epic: None
Informs: cockroachlabs/support#3463

When a pausable portal encounters an error during execution, two issues
can lead to panics on subsequent resume attempts:

1. The underlying FlowBase gets reset to nil during cleanup, but the
 portal's flow reference remains non-nil, causing hasFlowForPausablePortal()
 to incorrectly return true.

2. Errored portals are not removed from the portal map because deletion
 only occurs when execStmt() returns a non-nil fsm.Event.

This change adds two defensive checks:
- Nil the whole flow object hanging off the portalInfo while cleaning up the flow.
- Ensure errored portals are properly cleaned up regardless of event state

These gates prevent nil pointer dereferences when resuming portals that
have been partially cleaned up due to errors.

Release note: None
@ZhouXing19 ZhouXing19 marked this pull request as ready for review October 22, 2025 14:15
@ZhouXing19 ZhouXing19 requested a review from a team as a code owner October 22, 2025 14:15
@ZhouXing19 ZhouXing19 requested review from mgartner and removed request for a team and mgartner October 22, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants