You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a case when a local LDK node disconnects a remote peer (LND) on RAA timeout in order to restore the channel operation and send an alert to upstream. The issue is in our case the disconnect does not achieve the main goal of restoring the channel and continues disconnect, reconnect, re-establish cycle indefinitely.
The root cause was a failure in the remote signer for the local node that cause one channel to be stuck. The remote signer cam online almost immediately and continued to provide signatures to CS/RAA messages, but LDK was unable recover from the failed state of the stuck channel and did not request any new signatures. And because the node was disconnecting the balance was fluctuating causing other services down the stack to be unreasonably busy.
LDK logs (sorted to last first):
(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Disconnecting peer 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa due to not making any progress on channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb
... a minute before
(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Handling channel resumption for channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb with no RAA, no commitment update, 0 pending forwards, 0 pending update_add_htlcs, not broadcasting funding, without channel ready, without announcement, without tx_signatures
(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Generating channel update for channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb
(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Attempting to generate channel update for channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb
(peer_id = 039174f846626c6053ba80f5443d0db33da384f1dde135bf7080ba1eec46501aaa, channel_id = a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb): Reconnected channel a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869bbb with lost outbound RAA and lost remote commitment tx, but unable to send due to resend order, waiting on signer for commitment update
Error in the remote node (LND):
ChannelLink(c4918671944d25f41b8cc7d4181d6c7b6011dda819daecbfc81ac14a37235bbb:1): received warning message from peer: chan_id=a65623374ac11ac8bfecda19a8dd11607b6c1d18d4c78c1bf4254d9471869aaa, err=Disconnecting due to timeout awaiting response
ChannelPoint(c4918671944d25f41b8cc7d4181d6c7b6011dda819daecbfc81ac14a37235bbb:1): pending remote commitment: (*lnwallet.commitment)(0x400bb1e480)({
The channel re-establishment works manually, so maybe we should be more proactive in querying the remote signer, instead of just waiting on signer for commitment update.
The text was updated successfully, but these errors were encountered:
yellowred
changed the title
Spurious disconnects when can not make progress on a channel
Spurious disconnect loop when a channel is stuck
Mar 31, 2025
It does seem like something that should be fixed as a part of the async signing logic - if we're stuck waiting for an async signing operation we shouldn't "blame the peer" and disconnect, we should just keep going and maybe log an issue.
@wpaulino pointed out that even if we don't disconnect in this case, our peer should. Maybe they won't so maybe we can still fix it but in general they should so really this needs to be fixed by not having the async signer stall.
We have a case when a local LDK node disconnects a remote peer (LND) on RAA timeout in order to restore the channel operation and send an alert to upstream. The issue is in our case the disconnect does not achieve the main goal of restoring the channel and continues disconnect, reconnect, re-establish cycle indefinitely.
The root cause was a failure in the remote signer for the local node that cause one channel to be stuck. The remote signer cam online almost immediately and continued to provide signatures to CS/RAA messages, but LDK was unable recover from the failed state of the stuck channel and did not request any new signatures. And because the node was disconnecting the balance was fluctuating causing other services down the stack to be unreasonably busy.
LDK logs (sorted to last first):
Error in the remote node (LND):
The channel re-establishment works manually, so maybe we should be more proactive in querying the remote signer, instead of just
waiting on signer for commitment update
.The text was updated successfully, but these errors were encountered: