Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Race condition in next_when_notified #280

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ollie-etl
Copy link
Contributor

I believe the following fixes a race condition in FixedBufPool. We were observing a situation where, without resorting to timeouts to force retries, multiple concurrent calls to next would eventually all stall.

The fix (setting enable) is take directly from https://docs.rs/tokio/latest/tokio/sync/futures/struct.Notified.html#method.enable

The call to enable is important because otherwise if you have two calls to recv and two calls to send in parallel, the following could happen:

  1. Both calls to try_recv return None.
  2. Both new elements are added to the vector.
  3. The notify_one method is called twice, adding only a single permit to the Notify.
  4. Both calls to recv reach the Notified future. One of them consumes the permit, and the other sleeps forever.

By adding the Notified futures to the list by calling enable before try_recv, the notify_one calls in step three would remove the futures from the list and mark them notified instead of adding a permit to the Notify. This ensures that both futures are woken.

I believe this is happening in the presence of only concurrency, not parallelism (single threaded)

@ollie-etl
Copy link
Contributor Author

@mzabaluev you're probably best placed for an opinion

Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! My original code overlooked this possibility.

I have suggestions to improve the comments, otherwise the fix is good.

Comment on lines 276 to +279
// In the single-threaded case, no buffers could get checked in
// between us calling `try_next` and here, so we can't miss a wakeup.
notified.as_mut().await;
// between us calling `try_next` and here. However, we may still miss a wake-up,
// as multiple check-ins can occur before any waking tasks are scheduled,
// which would result in the loss of a permit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is right, thanks for the clarification. In fact, the preamble of this comment can be completely replaced because it's irrelevant: multiple buffers can still be checked in while awaiting on notified before another iteration of the loop.

@@ -284,6 +286,9 @@ impl<T: IoBufMut> FixedBufPool<T> {
return buf;
}

// Await notify_one
notified.as_mut().await;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment below clarifies why the whole branch below the if ... { ... return buf; } can be taken, so part of it should be placed above this.

OneOfOne added a commit to ooo-rs/tokio-uring-ooo that referenced this pull request Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants