Skip to content

[support bundle] Collect sled-specific bundle information concurrently #8106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 67 commits into from
May 15, 2025

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented May 6, 2025

In the interest of improving support bundle collection times, avoid bottlenecking on each sled individually.

Instead, collect from up to 16 sleds at once.

smklein added 30 commits April 11, 2025 10:33
@smklein smklein changed the title [support bundle] Collect bundles concurrently [support bundle] Collect sled-specific bundle information concurrently May 6, 2025
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the obnoxious drive-by :)

Comment on lines 572 to 575
let mut sled_stream = futures::stream::iter(all_sleds)
.map(|sled| async move {
self.collect_data_from_sled(&log, &sled, dir).await
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I note that while the collect_data_from_sled calls will execute concurrently, the async fns will not actually run in parallel, since they're not spawned in separate tasks. so there will only ever be one thread executing one of those futures at a time, although we will be running them concurrently and executing the next one when one yields. in this case, i think that's mostly okay, since it looks like most of the runtime of collect_data_from_sled is probably waiting on IO (waiting for the HTTP requests to come back and the tokio::fs operations, which run on the blocking threadpool, to complete). but, i might nonetheless consider spawning tasks for each sled, instead. up to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if you decide these should run in parallel i might consider using tokio::task::JoinSet rather than buffer_unordered, since it uses the task allocation to store the task's membership in the joinset and therefore uses much less heap. may not actually matter though.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's super interesting, thanks for sharing that bit of info on buffer_unordered vs JoinSet.

Copy link
Collaborator Author

@smklein smklein May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this refactor -- I'm aware that it was concurrent, not parallel -- but the 'static bound on the spawn-ed tasks means I probably can't operate on &self, and will need to refactor this into a free function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made these changes in 59e8750 - I'm using some manual counting to tasks to respect my previous "maximum bound on current activity".

Comment on lines +700 to +704
// Currently we execute up to 10 commands concurrently which
// might be doing their own concurrent work, for example
// collectiong `pstack` output of every Oxide process that is
// found on a sled.
.buffer_unordered(10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, concurrent but not parallel. whether that matters depends on how much CPU-bound work these functions do...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you mentioned fs operations being spawned onto the blocking thread pool -- does that mean futures_unordered can preform those tasks concurrently still since the await point is waiting for the result from the blocking thread typically? I am curious because the the log collection is a synchronous task which we use spawn_blocking for ourselves.

Base automatically changed from sb-health-check to main May 14, 2025 21:26
continue;
// While we have incoming work to send to tasks (sleds_iter)
// or a task operating on that data (tasks)...
while sleds_iter.peek().is_some() || !tasks.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could probably swap this to use a semaphore but I am not pressed about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into refactoring this behavior out as a follow-up

@smklein smklein merged commit 5c37744 into main May 15, 2025
16 checks passed
@smklein smklein deleted the concurrent-sb-collection branch May 15, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants