Skip to content

Nexus must hang onto qorb resolvers used for MGS updates #8466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 1, 2025

Conversation

davepacheco
Copy link
Collaborator

The failure mode here is that when using Nexus to update SPs, all the updates fail with messages like this:

23:40:39.213Z INFO bea4cc1e-0758-4643-a4b4-669f67689c6c (ServerContext): update attempt done
    artifact_hash = 100352eb10b8e657bfa168dddd6a3e1cff3f1a3f5600feacfa5bf6b8983ec145
    artifact_version = 1.0.39
    elapsed_millis = 0
    error = found no MGS backends in DNS
    expected_active_version = 1.0.38
    expected_inactive_version = Version(ArtifactVersion("1.0.38"))
    file = nexus/mgs-updates/src/driver.rs:422
    part_number = 913-0000006
    serial_number = BRM23230002
    sp_slot = 0
    sp_type = Switch
    update_id = e17b53c0-f1d6-4f1c-a77d-d63fc08587e5

It says "found no MGS backends in DNS" but there are MGS backends in DNS. The problem is that we dropped the resolvers and so they're not doing any DNS resolution.

@davepacheco davepacheco self-assigned this Jun 27, 2025
@davepacheco davepacheco requested a review from smklein June 27, 2025 00:37
/// DNS resolver used by MgsUpdateDriver for MGS
// We don't need to do anything with this, but we can't let it be dropped
// while Nexus is running.
#[allow(dead_code)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally these resolvers are handed off to the pool, so their lifetime is coupled with the pool itself.

Do you think qorb could have done anything more clear to identify that the resolver object needs to be kept alive for resolution to keep happening?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Maybe update the Resolver docs to say something like:

Resolvers generally do the bulk of their work (e.g., DNS resolution) in a separate tokio task. Dropping the resolver aborts that task. Any Receivers previously returned by monitor() will contain the last-updated set of backends indefinitely.

This last thing isn't qorb's fault but I found it to be a surprising footgun (with watch channels, I guess) and so worth calling out. I thought I'd have seen a RecvError because the other end of the channel got dropped. But my watch consumer only ever uses borrow() so it didn't notice the channel was closed.

I'd also update the docs (and maybe name?) for monitor(). The name sounds like it's going to take some action and the docs say "Start running a resolver". But that's not right. The resolver is already running before you call it. Maybe call it subscribe() and just drop the "Start running a resolver" sentence? I understand though if it's not worth making a breaking change for this.

I also think it's worth mentioning the thing above under monitor(), something like:

Note that if the Resolver gets dropped, then Receivers previously returned by this method will stop getting updated, but they will contain to report the last known set of backends.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated docs in oxidecomputer/qorb#112 for Resolver::monitor

@davepacheco
Copy link
Collaborator Author

See #8291 for testing notes.

Base automatically changed from dap/mgs-update-db to main June 30, 2025 23:58
@davepacheco davepacheco enabled auto-merge (squash) July 1, 2025 00:12
@davepacheco davepacheco disabled auto-merge July 1, 2025 00:13
@davepacheco davepacheco merged commit bb0f689 into main Jul 1, 2025
16 checks passed
@davepacheco davepacheco deleted the dap/mgs-qorb-fix branch July 1, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants