proof of concept: stop sequentially, fixing racy shutdown#1464
Draft
Groxx wants to merge 4 commits intocadence-workflow:masterfrom
Draft
proof of concept: stop sequentially, fixing racy shutdown#1464Groxx wants to merge 4 commits intocadence-workflow:masterfrom
Groxx wants to merge 4 commits intocadence-workflow:masterfrom
Conversation
Thanks to some internal tooling to look for deadlocked goroutines, this test: > appears to go.uber.org/cadence/internal.(*WorkflowTestSuiteUnitTest).Test_ContextMisuse appears to trigger a goroutine leak. Since it's not doing anything *impossible* for users to do (it's just using the wrong context arg), it seems like this implies we've got incorrect goroutine shutdown code. I'm not yet fully confident that this is The Fix and that it does not cause other issues, but it seems pretty likely so far. I don't know if this would explain the in-production leaks we've seen in the past but not reproduced (those seem to be query-triggered somehow), but it *does* seem quite likely that this has leaked somewhere. Just hopefully also caused errors that pointed to bad code which led to a fix.
this relies on changes merged in from maybe-bad-yield, as otherwise some tests deadlock. Signed-off-by: Steven L <imgroxx@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Probable fix for our racy shutdown issue: just wait for the coroutine's goroutine to close, before stopping the next.
Since this is fairly likely to lead to getting stuck in incorrect code, it also reports failures to stop "quickly" (1 second), which is quite permissive as ~all correct code should finish on the order of a millisecond or less. Extreme system load could still cause this to be exceeded, but it retains correct behavior even then so I think we can address those log/metric complaints if/when they occur.
This also includes an earlier proof-of-concept to fix what might be a source of our goroutine leaks: using the wrong context writes to the wrong aboutToBlock channel, causing more writes to occur on it than reads, causing a deadlock.
The good(?) news is that fixing our shutdown logic reveals this deadlock in the
ContextMisusetest - tests get stuck, and actrl-\shows its stacks. Merging in the wrong-chan-write fix resolves that, and tests run quickly/normally again.