-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[Fix] Harden the router's resolver #3540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Conversation
Signed-off-by: ljedrz <[email protected]>
…o a peer Signed-off-by: ljedrz <[email protected]>
Signed-off-by: ljedrz <[email protected]>
… peer Signed-off-by: ljedrz <[email protected]>
…connect Signed-off-by: ljedrz <[email protected]>
Signed-off-by: ljedrz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice tightening up of peer tracking! Did a first pass and the current changeset looks good 👍
None => { | ||
// No longer connected to the peer. | ||
return Ok(()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bail
leads back to calling self.router().resolve_to_listener(&peer_addr)
, so this logic doesn't actually change any real behavior (except for skipping a log message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the analysis:
Because resolve_to_listener(&peer_addr)
just returned None
,
returning Ok(())
instead of bail!(..)
from this point to either of the callers
validator::router::process_message_inner
or client::router::process_message_inner
means a warning and an extra failing call to resolve_to_listener(&peer_addr)
are skipped.
In the case of prover::router::process_message
there was no warning, so
just the extra failing call to resolve_to_listener(&peer_addr)
is skipped,
The end result is otherwise identical; in either case the message is not handled.
However:
The confusing part is, what does it mean for inbound()
to return Ok(())
?
Normally it means "This inbound message is fine; we handled it; don't try to disconnect.".
In this new particular case it means "Although the peer is not fine and we didn't handle the message,
don't give a warning or try to disconnect because it is no longer connected
and probably already got a warning."
I think the code would be cleaner if the meaning of returning Ok(())
were not overloaded.
// The peer informs us that they had disconnected. Disconnect from them too. | ||
debug!("Peer '{peer_ip}' decided to disconnect due to '{:?}'", message.reason); | ||
self.router().disconnect(peer_ip); | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new logic skips the call to
Outbound::send(self, peer_ip, Message::Disconnect(DisconnectReason::ProtocolViolation.into()));
. I believe this is fine because in theory the peer has already disconnected you since they are sending a Disconnect message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should work but like the other change in inbound
it complicates the interface. In this case the self.router().disconnect(peer_ip)
has to get copied from process_message{_internal}
to inbound
. This is the only call to disconnect()
in inbound
.
// FIXME (ljedrz): this shouldn't be necessary; it's a double-check | ||
// that the higher-level collection is consistent with the resolver. | ||
if router.is_connected(&peer_ip) { | ||
warn!("Fallback connection artifact cleanup (report this to @ljedrz)"); | ||
router.remove_connected_peer(peer_ip); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljedrz this approach seems like a hack. Can you confirm you've seen empirically ghost IPs/peers left over in the router?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I've seen them in the Canary network, hence this adjustment; last time I checked the logs there were only a handful or less, but it warrants a closer inspection, and justifies this fallback
@ljedrz Do we need to apply the same changes from |
@howardwu not necessarily; my recommendation would be to first introduce these changes, and then perform a new analysis of the logs, looking for protocol violation false positives and potential connection stability issues. These changes will make the picture a lot more clear. |
@ljedrz @niklaslong Which logs are you talking about? Were you able to reproduce the issue yourself? |
@joske I was analyzing the logs of one of the Canarynet clients before and after these changes. |
Could you share those logs? |
I recall very often seeing the errors Lukasz mentioned. I suggest you can just run your own local canary client as he suggests, if you don't pass any peers you should connect to bootstrap nodes who will connect you to others. |
While investigating a potential issue with some trusted peers being periodically dropped, I've noticed a lot of instances of
Unable to resolve the (...) address
in the log extracts from different networks. I believe most of them are triggered unnecessarily, but we need to be sure, and this PR aims to address this.The proposed changes are as follows:
inbound
method is "fed" from a lower-level queue which doesn't have an awareness of the address resolver, so the entries that fail to resolve there are basically guaranteed to be post-disconnect "stragglers" and may be ignored (instead of triggering potentially many redundant disconnect attempts, which result in further resolver-related warnings)Message::Disconnect
, we shouldn't report it as a protocol violation; this is mostly a cleanup of one or two misleading logsFiling as a draft for now, as I'm still looking for potential related issues in the logs.
Cc @zkxuerb