Skip to content

Spurious disconnect loop when a channel is stuck #3695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yellowred opened this issue Mar 31, 2025 · 3 comments
Open

Spurious disconnect loop when a channel is stuck #3695

yellowred opened this issue Mar 31, 2025 · 3 comments

Comments

@yellowred
Copy link
Contributor

We have a case when a local LDK node disconnects a remote peer (LND) on RAA timeout in order to restore the channel operation and send an alert to upstream. The issue is in our case the disconnect does not achieve the main goal of restoring the channel and continues disconnect, reconnect, re-establish cycle indefinitely.

The root cause was a failure in the remote signer for the local node that cause one channel to be stuck. The remote signer cam online almost immediately and continued to provide signatures to CS/RAA messages, but LDK was unable recover from the failed state of the stuck channel and did not request any new signatures. And because the node was disconnecting the balance was fluctuating causing other services down the stack to be unreasonably busy.

LDK logs (sorted to last first):

(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Disconnecting peer 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa due to not making any progress on channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb

... a minute before

(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Handling channel resumption for channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb with no RAA, no commitment update, 0 pending forwards, 0 pending update_add_htlcs, not broadcasting funding, without channel ready, without announcement, without tx_signatures

(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Generating channel update for channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb

(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Attempting to generate channel update for channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb

(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Reconnected channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb with lost outbound RAA and lost remote commitment tx, but unable to send due to resend order, waiting on signer for commitment update

Error in the remote node (LND):

ChannelLink(c4918671944d25f41b8cc7d4181d6c7b6011dda819daecbfc81ac14a37235bbb:1): received warning message from peer: chan_id=a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869aaa, err=Disconnecting due to timeout awaiting response
ChannelPoint(c4918671944d25f41b8cc7d4181d6c7b6011dda819daecbfc81ac14a37235bbb:1): pending remote commitment: (*lnwallet.commitment)(0x400bb1e480)({

The channel re-establishment works manually, so maybe we should be more proactive in querying the remote signer, instead of just waiting on signer for commitment update.

@yellowred yellowred changed the title Spurious disconnects when can not make progress on a channel Spurious disconnect loop when a channel is stuck Mar 31, 2025
@alecchendev
Copy link
Contributor

This may be something we can fix on our side, still figuring it out.

@TheBlueMatt
Copy link
Collaborator

It does seem like something that should be fixed as a part of the async signing logic - if we're stuck waiting for an async signing operation we shouldn't "blame the peer" and disconnect, we should just keep going and maybe log an issue.

@TheBlueMatt
Copy link
Collaborator

@wpaulino pointed out that even if we don't disconnect in this case, our peer should. Maybe they won't so maybe we can still fix it but in general they should so really this needs to be fixed by not having the async signer stall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants