Skip to content

Ensure NVMe restore task finishes before worker runs #1321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yupavlen-ms
Copy link
Contributor

Running real life scenarios shows failures which only can be explained by race conditions.
It appears that the original intent to wait for restore task to finish before worker runs, has a gap.
The fix is to add a proper wait for futures to complete.

@yupavlen-ms yupavlen-ms requested a review from a team as a code owner May 8, 2025 18:39
saved_state.as_ref().unwrap(),
)];
// TODO: Move join_all into NvmeManagerWorker::restore instead?
join_all(state_vec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this what you want? who do you want to block , the caller of new, or the user of this task? or we just need to restore everything before we do the run loop below?

Copy link
Contributor Author

@yupavlen-ms yupavlen-ms May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to restore everything before worker.run() is started and will begin accepting GetNamespace requests. Restore task is async, because of vfio uevent waiting, and we can get conflicting namespace init/restore.

&mut worker,
saved_state.as_ref().unwrap(),
)];
// TODO: Move join_all into NvmeManagerWorker::restore instead?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this TODO is not going to happen but I put it here anyway if someone can suggest if that is also safe as the currently proposed solution.

@@ -109,7 +109,12 @@ impl NvmeManager {
let task = driver.spawn("nvme-manager", async move {
// Restore saved data (if present) before async worker thread runs.
if saved_state.is_some() {
let _ = NvmeManager::restore(&mut worker, saved_state.as_ref().unwrap())
let state_vec = vec![NvmeManager::restore(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make a vec and use join_all when we only have one function being called? Why not just join it directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my concern too. But I couldn't find a join variant which accepts single future, there should be at least two. Vectorized join just works. Or I was looking at the wrong places? Please correct me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that just be .await?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we were doing before actually. So what's changing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it guarantee sequential execution? If you look inside NvmeManager::restore - it has its own .await but the Kusto results suggest the parallel execution of both NvmeManager::restore and worker.run

Does bringing .await one level up solves this problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await does guarantee sequential execution, yes. I don't think changing the level of the awaits would matter.

However the task being created by driver.spawn can run in parallel with other code. Is it possible that that's what you're observing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other code is fine, we should not block it, the problem I saw is that run task can respond to Request::GetNamespace before restore task finishes. So instead of restoring driver state we can initialize it again and then try to restore it.

This change is to prevent starting run task (e.g. the spawned task should not reach line worker.run() before everything is restored).

Adding join_all made some visible changes in the logging when tested in the lab, now it does look more serialized, although we didn't get scaled test results yet due to recent circumstances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that that's what you're seeing? Is it possible that restore is hitting an error and returning early? Currently you're ignoring that error, should you instead be unwrapping it or something?

Copy link
Contributor

@smalis-msft smalis-msft May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason there's no join that takes a single future is because that's what await does. If you want to run multiple futures concurrently you need something like join_all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, if there was an error in restore, that could be another explanation. Let me add better error handling then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants