Skip to content

Conversation

@Roasbeef
Copy link
Member

In this PR, we fix a potential infinite loop in the gossip syncer. The issue is that we don't check the errors that can be returned from methods like handleSyncingChans. If we get a context cancelled or similar error, then the main loop will stay alive, rather than being torn down.

We fix this by adding error checks, along with tests that demonstrate that without these the main loop will continue to live.

An alternative to #10329

@gemini-code-assist
Copy link

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical stability issue in the GossipSyncer where the main synchronization goroutine could enter an infinite loop if underlying operations failed due to context cancellation or other network-related errors. The changes introduce robust error propagation and handling mechanisms in key state-transition functions, ensuring that the syncer goroutine terminates gracefully when encountering unrecoverable errors. This significantly improves the reliability and resource management of the gossip discovery process.

Highlights

  • Infinite Loop Fix: Addressed a potential infinite loop bug within the GossipSyncer component by ensuring that context cancellation or other fatal errors are properly checked and handled, leading to a graceful goroutine exit.
  • Error Handling Improvements: Modified handleSyncingChans and synchronizeChanIDs functions to return errors, allowing their callers to detect and react to fatal conditions, preventing the main syncer loop from persisting indefinitely.
  • Comprehensive Testing: Introduced new testing utilities, including an errorInjector and assertSyncerExitsCleanly helper, along with new test cases (TestGossipSyncerStateHandlerErrors, TestGossipSyncerProcessChanRangeReplyError) to rigorously verify that the GossipSyncer goroutine exits cleanly under various error scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Roasbeef Roasbeef force-pushed the fix-error-handling-gossiper branch from d33971d to 87b5c3d Compare October 29, 2025 20:00
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a potential infinite loop in the gossip syncer by ensuring that fatal errors, such as context cancellation, are properly propagated up from state handlers to terminate the main loop. The changes to handleSyncingChans and synchronizeChanIDs to return errors, and the handling of these errors in channelGraphSyncer, are correct and well-implemented. The addition of comprehensive tests, including an errorInjector for controlled error simulation and specific test cases for different failure modes, is excellent and provides strong confidence in the fix. I have one minor suggestion to improve the precision of an assertion in a new test.

Comment on lines 2704 to 2706
// queued (first was processed, rest are stuck).
channelLen := len(syncer.gossipMsgs)
require.GreaterOrEqual(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic of this test suggests that after the first malformed reply is processed and returns an error, the channelGraphSyncer goroutine should exit. The subsequent 9 replies sent in the loop should therefore remain buffered in the gossipMsgs channel. This means the final length of syncer.gossipMsgs should be exactly 9. The current assertion require.GreaterOrEqual(t, channelLen, 7, ...) is a bit loose. Consider tightening it to require.Equal(t, 9, channelLen, ...) for a more precise test.

Suggested change
// queued (first was processed, rest are stuck).
channelLen := len(syncer.gossipMsgs)
require.GreaterOrEqual(
require.Equal(t, 9, channelLen,
"expected 9 messages queued (goroutine exited), but found %d - "+
"goroutine may still be processing", channelLen)

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a different idea, but either way works i guess!

// understand, as we'll as responding to any other queries by
// them.
case syncingChans:
g.handleSyncingChans(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could instead (or in addition) just add a quick context check at the start of the for loop. I think that's an ok pattern to do in a for loop if it doesnt already have a select

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was the alternative here: #10329 but I would say introducing the errors is the better way at I like the design better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was my alternative, as IMO it got at the root issue in that we weren't checking errors for these calls to decide to exit the state machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also a bit easier to write consistent unit tests for as well.

@saubyk saubyk added this to lnd v0.20 Oct 31, 2025
@saubyk saubyk moved this to In review in lnd v0.20 Oct 31, 2025
@saubyk saubyk added this to the v0.20.0 milestone Oct 31, 2025
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM-pending CI - I think that approach works.

I am a bit hestiant regarding all the tests and their potential flakiness because the are very timing dependant, ,hope they do not introduce flakes into the CI.

g.cg.Quit()
})
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we are at it, could you also stop the iterator in this function here:

if !g.isSendingBacklog.CompareAndSwap(false, true) {
		returnSema()
		log.Debugf("GossipSyncer(%x): another goroutine already "+
			"sending backlog, skipping", g.cfg.peerPub[:])

		return nil
	}

basically adding a stop() at the beginning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop what iterator? It isn't in scope here.

// Check how many send attempts were made. This verifies
// that the state handler doesn't loop endlessly.
attemptCount := errInj.getAttemptCount()
require.GreaterOrEqual(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this style of testing can introduce flakes when the timing and the counts are a bit off, probably not worth adding these kinda tests, the changes are very easy to understand which were made ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them so we can make sure that the tests actually do something. I had a test that passed, but then turns out it didn't actually do anything, as the test survived some trivial mutations in the area that we had fixed.

I can drop the last commit with the additional tests if we want, those were some extra mutations I found with an automated tool I made.

@Roasbeef Roasbeef force-pushed the fix-error-handling-gossiper branch 2 times, most recently from 8ef28f2 to 928b32b Compare October 31, 2025 19:08
@Roasbeef Roasbeef requested a review from ziggie1984 October 31, 2025 19:08
@ziggie1984
Copy link
Collaborator

Linter still complaining

Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Still Linter to fix tho

This commit fixes a critical bug where the channelGraphSyncer goroutine
would enter an endless loop when context cancellation or peer disconnect
errors occurred during the syncingChans or queryNewChannels states.

The root cause was that state handler functions (handleSyncingChans and
synchronizeChanIDs) did not return errors to the main goroutine loop.
When these functions encountered fatal errors like context cancellation,
they would log the error and return early without changing the syncer's
state. This caused the main loop to immediately re-enter the same state
handler, encounter the same error, and loop indefinitely while spamming
error logs.

The fix makes error handling explicit by having state handlers return
errors. The main channelGraphSyncer loop now checks these errors and
exits cleanly when fatal errors occur. We return any error (not just
context cancellation) because fatal errors can manifest in multiple
forms: context.Canceled, ErrGossipSyncerExiting from the rate limiter,
lnpeer.ErrPeerExiting from Brontide, or network errors like connection
closed. This approach matches the error handling pattern already used in
other goroutines like replyHandler.
Add comprehensive test coverage to verify that state handler errors cause
the channelGraphSyncer goroutine to exit cleanly without entering endless
retry loops. These tests use mutation testing principles to ensure they
would fail if the fixes were removed.

TestGossipSyncerStateHandlerErrors is a table-driven test covering four
scenarios: context cancellation and peer disconnect during syncingChans
state, and context cancellation and network errors during queryNewChannels
state. Each test case verifies both attempt count (no endless loop) and
clean shutdown (no deadlock).

TestGossipSyncerProcessChanRangeReplyError verifies that errors from
processChanRangeReply in the waitingQueryRangeReply state cause clean
exit. This test sends multiple malformed messages and checks that only
the first is processed before the goroutine exits, using channel queue
depth to detect if the goroutine is still running.

All tests are race-detector clean and use mutation testing validation:
removing any of the error return statements causes the corresponding
tests to fail, confirming the tests properly verify the fixes.
@Roasbeef Roasbeef force-pushed the fix-error-handling-gossiper branch from 928b32b to adaa893 Compare November 3, 2025 18:44
@Roasbeef Roasbeef enabled auto-merge November 3, 2025 18:46
@Roasbeef Roasbeef merged commit 4131a8e into lightningnetwork:master Nov 3, 2025
34 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in lnd v0.20 Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants