Skip to content

[Fix] Harden the router's resolver #3540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: staging
Choose a base branch
from

Conversation

ljedrz
Copy link
Collaborator

@ljedrz ljedrz commented Mar 14, 2025

While investigating a potential issue with some trusted peers being periodically dropped, I've noticed a lot of instances of Unable to resolve the (...) address in the log extracts from different networks. I believe most of them are triggered unnecessarily, but we need to be sure, and this PR aims to address this.

The proposed changes are as follows:

  • 1e8dc49 - changes the dual-lock setup of the resolver to a single-lock one in order to avoid any possibility of mismatch between the address maps; it should also slightly improve its performance
  • 260e84b - the inbound method is "fed" from a lower-level queue which doesn't have an awareness of the address resolver, so the entries that fail to resolve there are basically guaranteed to be post-disconnect "stragglers" and may be ignored (instead of triggering potentially many redundant disconnect attempts, which result in further resolver-related warnings)
  • 6bb8a74 - this swaps the order of disconnect-related operations, altering the resolver only after a peer is no longer marked as connected; this will avoid situations where an outbound message is greenlit to be sent to a peer (who is marked as connected) only to fail at address resolution right afterwards, triggering a bogus warning
  • bccf29a - this is a loosely-related drive-by; we should clear any peer-related cache before marking them as a candidate for connections, in order to avoid a (highly unlikely) scenario where the peer is reconnected to while having outdated cache entries, or even having new and applicable cache entries cleared
  • 7ee66b4 - when a peer sends us a Message::Disconnect, we shouldn't report it as a protocol violation; this is mostly a cleanup of one or two misleading logs
  • b673d7b - since I've seen some instances of the heartbeat process reporting lingering inactive peers, we should have a fallback cleanup of high-level connection artifacts in case the resolver can't find the physically connected address

Filing as a draft for now, as I'm still looking for potential related issues in the logs.

Cc @zkxuerb

Copy link
Collaborator

@niklaslong niklaslong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice tightening up of peer tracking! Did a first pass and the current changeset looks good 👍

Comment on lines +68 to +71
None => {
// No longer connected to the peer.
return Ok(());
}
Copy link
Member

@howardwu howardwu Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljedrz @raychu86 can you double check this? Ok(()) maintains the connection

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bail leads back to calling self.router().resolve_to_listener(&peer_addr), so this logic doesn't actually change any real behavior (except for skipping a log message).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the analysis:
Because resolve_to_listener(&peer_addr) just returned None,
returning Ok(()) instead of bail!(..) from this point to either of the callers
validator::router::process_message_inner or client::router::process_message_inner
means a warning and an extra failing call to resolve_to_listener(&peer_addr) are skipped.
In the case of prover::router::process_message there was no warning, so
just the extra failing call to resolve_to_listener(&peer_addr) is skipped,
The end result is otherwise identical; in either case the message is not handled.

However:
The confusing part is, what does it mean for inbound() to return Ok(())?
Normally it means "This inbound message is fine; we handled it; don't try to disconnect.".
In this new particular case it means "Although the peer is not fine and we didn't handle the message,
don't give a warning or try to disconnect because it is no longer connected
and probably already got a warning."

I think the code would be cleaner if the meaning of returning Ok(()) were not overloaded.

Comment on lines +148 to +151
// The peer informs us that they had disconnected. Disconnect from them too.
debug!("Peer '{peer_ip}' decided to disconnect due to '{:?}'", message.reason);
self.router().disconnect(peer_ip);
Ok(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljedrz @raychu86 can you double check this? Ok(()) maintains the connection

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new logic skips the call to

Outbound::send(self, peer_ip, Message::Disconnect(DisconnectReason::ProtocolViolation.into()));. I believe this is fine because in theory the peer has already disconnected you since they are sending a Disconnect message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it should work but like the other change in inbound it complicates the interface. In this case the self.router().disconnect(peer_ip) has to get copied from process_message{_internal} to inbound. This is the only call to disconnect() in inbound.

Comment on lines +230 to +235
// FIXME (ljedrz): this shouldn't be necessary; it's a double-check
// that the higher-level collection is consistent with the resolver.
if router.is_connected(&peer_ip) {
warn!("Fallback connection artifact cleanup (report this to @ljedrz)");
router.remove_connected_peer(peer_ip);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljedrz this approach seems like a hack. Can you confirm you've seen empirically ghost IPs/peers left over in the router?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I've seen them in the Canary network, hence this adjustment; last time I checked the logs there were only a handful or less, but it warrants a closer inspection, and justifies this fallback

@howardwu
Copy link
Member

@ljedrz Do we need to apply the same changes from Router to Gateway?

@ljedrz
Copy link
Collaborator Author

ljedrz commented Mar 26, 2025

@howardwu not necessarily; my recommendation would be to first introduce these changes, and then perform a new analysis of the logs, looking for protocol violation false positives and potential connection stability issues. These changes will make the picture a lot more clear.

@joske
Copy link
Contributor

joske commented Apr 10, 2025

@ljedrz @niklaslong Which logs are you talking about? Were you able to reproduce the issue yourself?

@ljedrz
Copy link
Collaborator Author

ljedrz commented Apr 10, 2025

@joske I was analyzing the logs of one of the Canarynet clients before and after these changes.

@joske
Copy link
Contributor

joske commented Apr 10, 2025

Could you share those logs?

@vicsn
Copy link
Collaborator

vicsn commented Apr 11, 2025

Could you share those logs?

I recall very often seeing the errors Lukasz mentioned. I suggest you can just run your own local canary client as he suggests, if you don't pass any peers you should connect to bootstrap nodes who will connect you to others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants