[Fix] Harden the router's resolver #3540

ljedrz · 2025-03-14T10:27:53Z

While investigating a potential issue with some trusted peers being periodically dropped, I've noticed a lot of instances of Unable to resolve the (...) address in the log extracts from different networks. I believe most of them are triggered unnecessarily, but we need to be sure, and this PR aims to address this.

The proposed changes are as follows:

1e8dc49 - changes the dual-lock setup of the resolver to a single-lock one in order to avoid any possibility of mismatch between the address maps; it should also slightly improve its performance
260e84b - the inbound method is "fed" from a lower-level queue which doesn't have an awareness of the address resolver, so the entries that fail to resolve there are basically guaranteed to be post-disconnect "stragglers" and may be ignored (instead of triggering potentially many redundant disconnect attempts, which result in further resolver-related warnings)
6bb8a74 - this swaps the order of disconnect-related operations, altering the resolver only after a peer is no longer marked as connected; this will avoid situations where an outbound message is greenlit to be sent to a peer (who is marked as connected) only to fail at address resolution right afterwards, triggering a bogus warning
bccf29a - this is a loosely-related drive-by; we should clear any peer-related cache before marking them as a candidate for connections, in order to avoid a (highly unlikely) scenario where the peer is reconnected to while having outdated cache entries, or even having new and applicable cache entries cleared
7ee66b4 - when a peer sends us a Message::Disconnect, we shouldn't report it as a protocol violation; this is mostly a cleanup of one or two misleading logs
b673d7b - since I've seen some instances of the heartbeat process reporting lingering inactive peers, we should have a fallback cleanup of high-level connection artifacts in case the resolver can't find the physically connected address

Filing as a draft for now, as I'm still looking for potential related issues in the logs.

Cc @zkxuerb

Signed-off-by: ljedrz <[email protected]>

…o a peer Signed-off-by: ljedrz <[email protected]>

Signed-off-by: ljedrz <[email protected]>

… peer Signed-off-by: ljedrz <[email protected]>

…connect Signed-off-by: ljedrz <[email protected]>

Signed-off-by: ljedrz <[email protected]>

niklaslong

A nice tightening up of peer tracking! Did a first pass and the current changeset looks good 👍

howardwu · 2025-03-26T16:41:43Z

node/router/src/inbound.rs

+            None => {
+                // No longer connected to the peer.
+                return Ok(());
+            }


@ljedrz @raychu86 can you double check this? Ok(()) maintains the connection

The bail leads back to calling self.router().resolve_to_listener(&peer_addr), so this logic doesn't actually change any real behavior (except for skipping a log message).

I agree with the analysis:
Because resolve_to_listener(&peer_addr) just returned None,
returning Ok(()) instead of bail!(..) from this point to either of the callers
validator::router::process_message_inner or client::router::process_message_inner
means a warning and an extra failing call to resolve_to_listener(&peer_addr) are skipped.
In the case of prover::router::process_message there was no warning, so
just the extra failing call to resolve_to_listener(&peer_addr) is skipped,
The end result is otherwise identical; in either case the message is not handled.

However:
The confusing part is, what does it mean for inbound() to return Ok(())?
Normally it means "This inbound message is fine; we handled it; don't try to disconnect.".
In this new particular case it means "Although the peer is not fine and we didn't handle the message,
don't give a warning or try to disconnect because it is no longer connected
and probably already got a warning."

I think the code would be cleaner if the meaning of returning Ok(()) were not overloaded.

howardwu · 2025-03-26T16:42:00Z

node/router/src/inbound.rs

+                // The peer informs us that they had disconnected. Disconnect from them too.
+                debug!("Peer '{peer_ip}' decided to disconnect due to '{:?}'", message.reason);
+                self.router().disconnect(peer_ip);
+                Ok(())


@ljedrz @raychu86 can you double check this? Ok(()) maintains the connection

This new logic skips the call to

Outbound::send(self, peer_ip, Message::Disconnect(DisconnectReason::ProtocolViolation.into()));. I believe this is fine because in theory the peer has already disconnected you since they are sending a Disconnect message.

I agree it should work but like the other change in inbound it complicates the interface. In this case the self.router().disconnect(peer_ip) has to get copied from process_message{_internal} to inbound. This is the only call to disconnect() in inbound.

howardwu · 2025-03-26T16:42:37Z

node/router/src/lib.rs

+                // FIXME (ljedrz): this shouldn't be necessary; it's a double-check
+                //  that the higher-level collection is consistent with the resolver.
+                if router.is_connected(&peer_ip) {
+                    warn!("Fallback connection artifact cleanup (report this to @ljedrz)");
+                    router.remove_connected_peer(peer_ip);
+                }


@ljedrz this approach seems like a hack. Can you confirm you've seen empirically ghost IPs/peers left over in the router?

yes, I've seen them in the Canary network, hence this adjustment; last time I checked the logs there were only a handful or less, but it warrants a closer inspection, and justifies this fallback

howardwu · 2025-03-26T17:53:07Z

@ljedrz Do we need to apply the same changes from Router to Gateway?

ljedrz · 2025-03-26T17:56:20Z

@howardwu not necessarily; my recommendation would be to first introduce these changes, and then perform a new analysis of the logs, looking for protocol violation false positives and potential connection stability issues. These changes will make the picture a lot more clear.

joske · 2025-04-10T18:07:00Z

@ljedrz @niklaslong Which logs are you talking about? Were you able to reproduce the issue yourself?

ljedrz · 2025-04-10T18:08:58Z

@joske I was analyzing the logs of one of the Canarynet clients before and after these changes.

joske · 2025-04-10T19:04:39Z

Could you share those logs?

vicsn · 2025-04-11T10:33:58Z

Could you share those logs?

I recall very often seeing the errors Lukasz mentioned. I suggest you can just run your own local canary client as he suggests, if you don't pass any peers you should connect to bootstrap nodes who will connect you to others.

ljedrz added 6 commits March 13, 2025 17:17

fix: ensure that the resolver's maps may only be modified in tandem

1e8dc49

Signed-off-by: ljedrz <[email protected]>

logs: no longer WARN if queued messages can no longer be attributed t…

260e84b

…o a peer Signed-off-by: ljedrz <[email protected]>

fix: remove connected peer entries before the resolver ones

6bb8a74

Signed-off-by: ljedrz <[email protected]>

fix: clear peer cache entries before re-registering it as a candidate…

bccf29a

… peer Signed-off-by: ljedrz <[email protected]>

fix: don't report protocol violation when a peer performs a clean dis…

7ee66b4

…connect Signed-off-by: ljedrz <[email protected]>

fix: perform a fallback connection artifact cleanup if necessary

b673d7b

Signed-off-by: ljedrz <[email protected]>

niklaslong reviewed Mar 14, 2025

View reviewed changes

howardwu reviewed Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Harden the router's resolver #3540

[Fix] Harden the router's resolver #3540

Uh oh!

ljedrz commented Mar 14, 2025 •

edited

Loading

Uh oh!

niklaslong left a comment

Uh oh!

howardwu Mar 26, 2025 •

edited

Loading

Uh oh!

raychu86 Mar 26, 2025

Uh oh!

bendyarm May 23, 2025

Uh oh!

howardwu Mar 26, 2025

Uh oh!

raychu86 Mar 26, 2025

Uh oh!

bendyarm May 27, 2025

Uh oh!

howardwu Mar 26, 2025

Uh oh!

ljedrz Mar 26, 2025

Uh oh!

howardwu commented Mar 26, 2025

Uh oh!

ljedrz commented Mar 26, 2025

Uh oh!

joske commented Apr 10, 2025

Uh oh!

ljedrz commented Apr 10, 2025

Uh oh!

joske commented Apr 10, 2025

Uh oh!

vicsn commented Apr 11, 2025

Uh oh!

Uh oh!

[Fix] Harden the router's resolver #3540

Are you sure you want to change the base?

[Fix] Harden the router's resolver #3540

Uh oh!

Conversation

ljedrz commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niklaslong left a comment

Choose a reason for hiding this comment

Uh oh!

howardwu Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raychu86 Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

bendyarm May 23, 2025

Choose a reason for hiding this comment

Uh oh!

howardwu Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

raychu86 Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

bendyarm May 27, 2025

Choose a reason for hiding this comment

Uh oh!

howardwu Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

ljedrz Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

howardwu commented Mar 26, 2025

Uh oh!

ljedrz commented Mar 26, 2025

Uh oh!

joske commented Apr 10, 2025

Uh oh!

ljedrz commented Apr 10, 2025

Uh oh!

joske commented Apr 10, 2025

Uh oh!

vicsn commented Apr 11, 2025

Uh oh!

Uh oh!

ljedrz commented Mar 14, 2025 •

edited

Loading

howardwu Mar 26, 2025 •

edited

Loading