Skip to content

[nexus] Create a longer-lived connection from Nexus -> Sled Agent #8149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 27 additions & 7 deletions nexus/networking/src/sled_client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,29 +25,49 @@ pub fn sled_lookup<'a>(
Ok(sled)
}

pub fn default_reqwest_client_builder() -> reqwest::ClientBuilder {
let dur = std::time::Duration::from_secs(60);
reqwest::ClientBuilder::new().connect_timeout(dur).timeout(dur)
}

pub async fn sled_client(
datastore: &DataStore,
lookup_opctx: &OpContext,
sled_id: Uuid,
log: &Logger,
) -> Result<SledAgentClient, Error> {
let client = default_reqwest_client_builder().build().unwrap();
sled_client_ext(datastore, lookup_opctx, sled_id, log, client).await
}

pub async fn sled_client_ext(
datastore: &DataStore,
lookup_opctx: &OpContext,
sled_id: Uuid,
log: &Logger,
client: reqwest::Client,
) -> Result<SledAgentClient, Error> {
let (.., sled) =
sled_lookup(datastore, lookup_opctx, sled_id)?.fetch().await?;

Ok(sled_client_from_address(sled_id, sled.address(), log))
Ok(sled_client_from_address_ext(sled_id, sled.address(), log, client))
}

pub fn sled_client_from_address(
sled_id: Uuid,
address: SocketAddrV6,
log: &Logger,
) -> SledAgentClient {
let client = default_reqwest_client_builder().build().unwrap();
sled_client_from_address_ext(sled_id, address, log, client)
}

pub fn sled_client_from_address_ext(
sled_id: Uuid,
address: SocketAddrV6,
log: &Logger,
client: reqwest::Client,
) -> SledAgentClient {
let log = log.new(o!("SledAgent" => sled_id.to_string()));
let dur = std::time::Duration::from_secs(60);
let client = reqwest::ClientBuilder::new()
.connect_timeout(dur)
.timeout(dur)
.build()
.unwrap();
SledAgentClient::new_with_client(&format!("http://{address}"), client, log)
}
13 changes: 12 additions & 1 deletion nexus/src/app/sled.rs
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,16 @@ impl super::Nexus {
pub async fn sled_client(
&self,
id: &SledUuid,
) -> Result<Arc<SledAgentClient>, Error> {
let client =
nexus_networking::default_reqwest_client_builder().build().unwrap();
self.sled_client_ext(id, client).await
}

pub async fn sled_client_ext(
&self,
id: &SledUuid,
client: reqwest::Client,
) -> Result<Arc<SledAgentClient>, Error> {
// TODO: We should consider injecting connection pooling here,
// but for now, connections to sled agents are constructed
Expand All @@ -155,11 +165,12 @@ impl super::Nexus {
// Frankly, returning an "Arc" here without a connection pool is a
// little silly; it's not actually used if each client connection exists
// as a one-shot.
let client = nexus_networking::sled_client(
let client = nexus_networking::sled_client_ext(
&self.db_datastore,
&self.opctx_alloc,
id.into_untyped_uuid(),
&self.log,
client,
)
.await?;
Ok(Arc::new(client))
Expand Down
15 changes: 14 additions & 1 deletion nexus/src/app/support_bundles.rs
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,20 @@ impl super::Nexus {
.db_datastore
.zpool_get_sled_if_in_service(&opctx, bundle.zpool_id.into())
.await?;
let client = self.sled_client(&sled_id).await?;

let short_timeout = std::time::Duration::from_secs(60);
let long_timeout = std::time::Duration::from_secs(3600);
Comment on lines +94 to +95
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be in the config file or something? It seems like it could, potentially, be useful to change them in the field if we see the long timeout hit...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to this, but I only wonder: is plumbing them into nexus config enough to make this modifiable? Seems like callers would still need to change the config among all nexuses, which would either require storing this value in the DB or deploying a new configuration via reconfigurator, right?

let client = nexus_networking::default_reqwest_client_builder()
// Continuing to read from the sled agent should happen relatively
// quickly.
.read_timeout(short_timeout)
// However, the bundle itself may be large. As long as we're
// continuing to make progress (see: read_timeout) we should be
// willing to keep transferring the bundle for a while longer.
.timeout(long_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may still be too short for a truly large support bundle, say 50 GiB. At a download speed of 10 MiB/s, that would take roughly 85 minutes to complete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you - I think range requests is how I want to solve this in the limit; I'm not sure I really want to allow connections to stay open in Nexus for longer than an hour.

WDYT - should we still try to do this PR as-is? Do you think we should aim for a higher number? Or should we accept this limitation until range requests are fully plumbed through?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial concern was a customer being completely blocked by this timeout, but we could extract individual files via the API pretty easily, so that's less of an issue. An hour seems like a reasonable limit until we can get range requests in.

.build()
.expect("Failed to build reqwest Client");
let client = self.sled_client_ext(&sled_id, client).await?;

// TODO: Use "range"?

Expand Down
Loading