-
Notifications
You must be signed in to change notification settings - Fork 45
Nexus must hang onto qorb resolvers used for MGS updates #8466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// DNS resolver used by MgsUpdateDriver for MGS | ||
// We don't need to do anything with this, but we can't let it be dropped | ||
// while Nexus is running. | ||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally these resolvers are handed off to the pool, so their lifetime is coupled with the pool itself.
Do you think qorb could have done anything more clear to identify that the resolver object needs to be kept alive for resolution to keep happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Maybe update the Resolver
docs to say something like:
Resolvers generally do the bulk of their work (e.g., DNS resolution) in a separate tokio task. Dropping the resolver aborts that task. Any
Receiver
s previously returned bymonitor()
will contain the last-updated set of backends indefinitely.
This last thing isn't qorb's fault but I found it to be a surprising footgun (with watch channels, I guess) and so worth calling out. I thought I'd have seen a RecvError because the other end of the channel got dropped. But my watch consumer only ever uses borrow()
so it didn't notice the channel was closed.
I'd also update the docs (and maybe name?) for monitor()
. The name sounds like it's going to take some action and the docs say "Start running a resolver". But that's not right. The resolver is already running before you call it. Maybe call it subscribe()
and just drop the "Start running a resolver" sentence? I understand though if it's not worth making a breaking change for this.
I also think it's worth mentioning the thing above under monitor()
, something like:
Note that if the
Resolver
gets dropped, thenReceiver
s previously returned by this method will stop getting updated, but they will contain to report the last known set of backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docs in oxidecomputer/qorb#112 for Resolver::monitor
See #8291 for testing notes. |
The failure mode here is that when using Nexus to update SPs, all the updates fail with messages like this:
It says "found no MGS backends in DNS" but there are MGS backends in DNS. The problem is that we dropped the resolvers and so they're not doing any DNS resolution.