-
Notifications
You must be signed in to change notification settings - Fork 2.2k
discovery: fix potential infinite loop bug re context cancel error handling in gossip syncer #10330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discovery: fix potential infinite loop bug re context cancel error handling in gossip syncer #10330
Conversation
Summary of ChangesHello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical stability issue in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
d33971d to
87b5c3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a potential infinite loop in the gossip syncer by ensuring that fatal errors, such as context cancellation, are properly propagated up from state handlers to terminate the main loop. The changes to handleSyncingChans and synchronizeChanIDs to return errors, and the handling of these errors in channelGraphSyncer, are correct and well-implemented. The addition of comprehensive tests, including an errorInjector for controlled error simulation and specific test cases for different failure modes, is excellent and provides strong confidence in the fix. I have one minor suggestion to improve the precision of an assertion in a new test.
discovery/syncer_test.go
Outdated
| // queued (first was processed, rest are stuck). | ||
| channelLen := len(syncer.gossipMsgs) | ||
| require.GreaterOrEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of this test suggests that after the first malformed reply is processed and returns an error, the channelGraphSyncer goroutine should exit. The subsequent 9 replies sent in the loop should therefore remain buffered in the gossipMsgs channel. This means the final length of syncer.gossipMsgs should be exactly 9. The current assertion require.GreaterOrEqual(t, channelLen, 7, ...) is a bit loose. Consider tightening it to require.Equal(t, 9, channelLen, ...) for a more precise test.
| // queued (first was processed, rest are stuck). | |
| channelLen := len(syncer.gossipMsgs) | |
| require.GreaterOrEqual( | |
| require.Equal(t, 9, channelLen, | |
| "expected 9 messages queued (goroutine exited), but found %d - "+ | |
| "goroutine may still be processing", channelLen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a different idea, but either way works i guess!
| // understand, as we'll as responding to any other queries by | ||
| // them. | ||
| case syncingChans: | ||
| g.handleSyncingChans(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could instead (or in addition) just add a quick context check at the start of the for loop. I think that's an ok pattern to do in a for loop if it doesnt already have a select
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was the alternative here: #10329 but I would say introducing the errors is the better way at I like the design better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was my alternative, as IMO it got at the root issue in that we weren't checking errors for these calls to decide to exit the state machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also a bit easier to write consistent unit tests for as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM-pending CI - I think that approach works.
I am a bit hestiant regarding all the tests and their potential flakiness because the are very timing dependant, ,hope they do not introduce flakes into the CI.
| g.cg.Quit() | ||
| }) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while we are at it, could you also stop the iterator in this function here:
if !g.isSendingBacklog.CompareAndSwap(false, true) {
returnSema()
log.Debugf("GossipSyncer(%x): another goroutine already "+
"sending backlog, skipping", g.cfg.peerPub[:])
return nil
}
basically adding a stop() at the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop what iterator? It isn't in scope here.
| // Check how many send attempts were made. This verifies | ||
| // that the state handler doesn't loop endlessly. | ||
| attemptCount := errInj.getAttemptCount() | ||
| require.GreaterOrEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this style of testing can introduce flakes when the timing and the counts are a bit off, probably not worth adding these kinda tests, the changes are very easy to understand which were made ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added them so we can make sure that the tests actually do something. I had a test that passed, but then turns out it didn't actually do anything, as the test survived some trivial mutations in the area that we had fixed.
I can drop the last commit with the additional tests if we want, those were some extra mutations I found with an automated tool I made.
8ef28f2 to
928b32b
Compare
|
Linter still complaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Still Linter to fix tho
This commit fixes a critical bug where the channelGraphSyncer goroutine would enter an endless loop when context cancellation or peer disconnect errors occurred during the syncingChans or queryNewChannels states. The root cause was that state handler functions (handleSyncingChans and synchronizeChanIDs) did not return errors to the main goroutine loop. When these functions encountered fatal errors like context cancellation, they would log the error and return early without changing the syncer's state. This caused the main loop to immediately re-enter the same state handler, encounter the same error, and loop indefinitely while spamming error logs. The fix makes error handling explicit by having state handlers return errors. The main channelGraphSyncer loop now checks these errors and exits cleanly when fatal errors occur. We return any error (not just context cancellation) because fatal errors can manifest in multiple forms: context.Canceled, ErrGossipSyncerExiting from the rate limiter, lnpeer.ErrPeerExiting from Brontide, or network errors like connection closed. This approach matches the error handling pattern already used in other goroutines like replyHandler.
Add comprehensive test coverage to verify that state handler errors cause the channelGraphSyncer goroutine to exit cleanly without entering endless retry loops. These tests use mutation testing principles to ensure they would fail if the fixes were removed. TestGossipSyncerStateHandlerErrors is a table-driven test covering four scenarios: context cancellation and peer disconnect during syncingChans state, and context cancellation and network errors during queryNewChannels state. Each test case verifies both attempt count (no endless loop) and clean shutdown (no deadlock). TestGossipSyncerProcessChanRangeReplyError verifies that errors from processChanRangeReply in the waitingQueryRangeReply state cause clean exit. This test sends multiple malformed messages and checks that only the first is processed before the goroutine exits, using channel queue depth to detect if the goroutine is still running. All tests are race-detector clean and use mutation testing validation: removing any of the error return statements causes the corresponding tests to fail, confirming the tests properly verify the fixes.
928b32b to
adaa893
Compare
In this PR, we fix a potential infinite loop in the gossip syncer. The issue is that we don't check the errors that can be returned from methods like
handleSyncingChans. If we get a context cancelled or similar error, then the main loop will stay alive, rather than being torn down.We fix this by adding error checks, along with tests that demonstrate that without these the main loop will continue to live.
An alternative to #10329