discovery: fix potential infinite loop bug re context cancel error handling in gossip syncer #10330

Roasbeef · 2025-10-29T19:59:07Z

In this PR, we fix a potential infinite loop in the gossip syncer. The issue is that we don't check the errors that can be returned from methods like handleSyncingChans. If we get a context cancelled or similar error, then the main loop will stay alive, rather than being torn down.

We fix this by adding error checks, along with tests that demonstrate that without these the main loop will continue to live.

An alternative to #10329

gemini-code-assist · 2025-10-29T19:59:22Z

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical stability issue in the GossipSyncer where the main synchronization goroutine could enter an infinite loop if underlying operations failed due to context cancellation or other network-related errors. The changes introduce robust error propagation and handling mechanisms in key state-transition functions, ensuring that the syncer goroutine terminates gracefully when encountering unrecoverable errors. This significantly improves the reliability and resource management of the gossip discovery process.

Highlights

Infinite Loop Fix: Addressed a potential infinite loop bug within the GossipSyncer component by ensuring that context cancellation or other fatal errors are properly checked and handled, leading to a graceful goroutine exit.
Error Handling Improvements: Modified handleSyncingChans and synchronizeChanIDs functions to return errors, allowing their callers to detect and react to fatal conditions, preventing the main syncer loop from persisting indefinitely.
Comprehensive Testing: Introduced new testing utilities, including an errorInjector and assertSyncerExitsCleanly helper, along with new test cases (TestGossipSyncerStateHandlerErrors, TestGossipSyncerProcessChanRangeReplyError) to rigorously verify that the GossipSyncer goroutine exits cleanly under various error scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a potential infinite loop in the gossip syncer by ensuring that fatal errors, such as context cancellation, are properly propagated up from state handlers to terminate the main loop. The changes to handleSyncingChans and synchronizeChanIDs to return errors, and the handling of these errors in channelGraphSyncer, are correct and well-implemented. The addition of comprehensive tests, including an errorInjector for controlled error simulation and specific test cases for different failure modes, is excellent and provides strong confidence in the fix. I have one minor suggestion to improve the precision of an assertion in a new test.

gemini-code-assist · 2025-10-29T20:01:17Z

discovery/syncer_test.go

+	// queued (first was processed, rest are stuck).
+	channelLen := len(syncer.gossipMsgs)
+	require.GreaterOrEqual(


The logic of this test suggests that after the first malformed reply is processed and returns an error, the channelGraphSyncer goroutine should exit. The subsequent 9 replies sent in the loop should therefore remain buffered in the gossipMsgs channel. This means the final length of syncer.gossipMsgs should be exactly 9. The current assertion require.GreaterOrEqual(t, channelLen, 7, ...) is a bit loose. Consider tightening it to require.Equal(t, 9, channelLen, ...) for a more precise test.

Suggested change

// queued (first was processed, rest are stuck).

channelLen := len(syncer.gossipMsgs)

require.GreaterOrEqual(

require.Equal(t, 9, channelLen,

"expected 9 messages queued (goroutine exited), but found %d - "+

"goroutine may still be processing", channelLen)

ellemouton

left a different idea, but either way works i guess!

ellemouton · 2025-10-31T05:37:31Z

discovery/syncer.go

 		// understand, as we'll as responding to any other queries by
 		// them.
 		case syncingChans:
-			g.handleSyncingChans(ctx)


could instead (or in addition) just add a quick context check at the start of the for loop. I think that's an ok pattern to do in a for loop if it doesnt already have a select

yeah that was the alternative here: #10329 but I would say introducing the errors is the better way at I like the design better.

Yeah this was my alternative, as IMO it got at the root issue in that we weren't checking errors for these calls to decide to exit the state machine.

This was also a bit easier to write consistent unit tests for as well.

ziggie1984

LGTM-pending CI - I think that approach works.

I am a bit hestiant regarding all the tests and their potential flakiness because the are very timing dependant, ,hope they do not introduce flakes into the CI.

discovery/syncer.go

ziggie1984 · 2025-10-31T11:28:57Z

discovery/syncer.go

 		g.cg.Quit()
 	})
 }



while we are at it, could you also stop the iterator in this function here:

if !g.isSendingBacklog.CompareAndSwap(false, true) { returnSema() log.Debugf("GossipSyncer(%x): another goroutine already "+ "sending backlog, skipping", g.cfg.peerPub[:]) return nil }

basically adding a stop() at the beginning

Stop what iterator? It isn't in scope here.

ziggie1984 · 2025-10-31T14:06:07Z

discovery/syncer_test.go

+			// Check how many send attempts were made. This verifies
+			// that the state handler doesn't loop endlessly.
+			attemptCount := errInj.getAttemptCount()
+			require.GreaterOrEqual(


I wonder if this style of testing can introduce flakes when the timing and the counts are a bit off, probably not worth adding these kinda tests, the changes are very easy to understand which were made ?

I added them so we can make sure that the tests actually do something. I had a test that passed, but then turns out it didn't actually do anything, as the test survived some trivial mutations in the area that we had fixed.

I can drop the last commit with the additional tests if we want, those were some extra mutations I found with an automated tool I made.

discovery/syncer_test.go

ziggie1984 · 2025-10-31T21:44:13Z

Linter still complaining

ziggie1984

LGTM - Still Linter to fix tho

This commit fixes a critical bug where the channelGraphSyncer goroutine would enter an endless loop when context cancellation or peer disconnect errors occurred during the syncingChans or queryNewChannels states. The root cause was that state handler functions (handleSyncingChans and synchronizeChanIDs) did not return errors to the main goroutine loop. When these functions encountered fatal errors like context cancellation, they would log the error and return early without changing the syncer's state. This caused the main loop to immediately re-enter the same state handler, encounter the same error, and loop indefinitely while spamming error logs. The fix makes error handling explicit by having state handlers return errors. The main channelGraphSyncer loop now checks these errors and exits cleanly when fatal errors occur. We return any error (not just context cancellation) because fatal errors can manifest in multiple forms: context.Canceled, ErrGossipSyncerExiting from the rate limiter, lnpeer.ErrPeerExiting from Brontide, or network errors like connection closed. This approach matches the error handling pattern already used in other goroutines like replyHandler.

Add comprehensive test coverage to verify that state handler errors cause the channelGraphSyncer goroutine to exit cleanly without entering endless retry loops. These tests use mutation testing principles to ensure they would fail if the fixes were removed. TestGossipSyncerStateHandlerErrors is a table-driven test covering four scenarios: context cancellation and peer disconnect during syncingChans state, and context cancellation and network errors during queryNewChannels state. Each test case verifies both attempt count (no endless loop) and clean shutdown (no deadlock). TestGossipSyncerProcessChanRangeReplyError verifies that errors from processChanRangeReply in the waitingQueryRangeReply state cause clean exit. This test sends multiple malformed messages and checks that only the first is processed before the goroutine exits, using channel queue depth to detect if the goroutine is still running. All tests are race-detector clean and use mutation testing validation: removing any of the error return statements causes the corresponding tests to fail, confirming the tests properly verify the fixes.

Roasbeef requested a review from ziggie1984 October 29, 2025 19:59

Roasbeef added the bug fix label Oct 29, 2025

Roasbeef force-pushed the fix-error-handling-gossiper branch from d33971d to 87b5c3d Compare October 29, 2025 20:00

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

saubyk assigned Roasbeef Oct 29, 2025

saubyk requested a review from ellemouton October 30, 2025 14:18

ellemouton approved these changes Oct 31, 2025

View reviewed changes

saubyk added this to lnd v0.20 Oct 31, 2025

saubyk moved this to In review in lnd v0.20 Oct 31, 2025

saubyk added this to the v0.20.0 milestone Oct 31, 2025

ziggie1984 reviewed Oct 31, 2025

View reviewed changes

Roasbeef force-pushed the fix-error-handling-gossiper branch 2 times, most recently from 8ef28f2 to 928b32b Compare October 31, 2025 19:08

Roasbeef requested a review from ziggie1984 October 31, 2025 19:08

ziggie1984 mentioned this pull request Nov 1, 2025

Gossiper - Bugfix context check #10329

Closed

ziggie1984 approved these changes Nov 3, 2025

View reviewed changes

Roasbeef added 3 commits November 3, 2025 10:44

docs: add release notes entry

adaa893

Roasbeef force-pushed the fix-error-handling-gossiper branch from 928b32b to adaa893 Compare November 3, 2025 18:44

Roasbeef enabled auto-merge November 3, 2025 18:46

Roasbeef merged commit 4131a8e into lightningnetwork:master Nov 3, 2025
34 checks passed

github-project-automation bot moved this from In review to Done in lnd v0.20 Nov 3, 2025

discovery: fix potential infinite loop bug re context cancel error handling in gossip syncer #10330

discovery: fix potential infinite loop bug re context cancel error handling in gossip syncer #10330

Uh oh!

Conversation

Roasbeef commented Oct 29, 2025

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ellemouton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ziggie1984 commented Oct 31, 2025

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants