-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moar threadsafe moar better #101
base: master
Are you sure you want to change the base?
Moar threadsafe moar better #101
Conversation
I tracked the test failures down to: JuliaLang/julia#53326 I think this should be fixed in Base, made a PR here: JuliaLang/julia#54571 |
9a8469d
to
ec8bce0
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #101 +/- ##
==========================================
+ Coverage 79.18% 79.30% +0.11%
==========================================
Files 10 10
Lines 1898 1918 +20
==========================================
+ Hits 1503 1521 +18
- Misses 395 397 +2 ☔ View full report in Codecov by Sentry. |
Alrighty, after a force push to trigger CI with latest nightly we are back in the green 🥳 I think this is ready to be merged now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the original code snippet from #73 (comment) is working now. 🚀
Will this be compatible to or backported, if necessary, to the next Julia LTS?
src/cluster.jl
Outdated
@async manage(w.manager, w.id, w.config, :register) | ||
# wait for rr_ntfy_join with timeout | ||
timedout = false | ||
@async (sleep($timeout); timedout = true; put!(rr_ntfy_join, 1)) | ||
@async begin | ||
sleep($timeout) | ||
timedout = true | ||
put!(rr_ntfy_join, 1) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these tasks need an errormonitor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense, added in 03d7384.
test/threads.jl
Outdated
ws = ts = product(1:2, 1:2) | ||
@testset "from worker $w1 to $w2 via 1" for (w1, w2) in ws | ||
@testset "from thread $w1.$t1 to $w2.$t2" for (t1, t2) in ts | ||
# We want (the default) lazyness, so that we wait for `Worker.c_state`! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We want (the default) lazyness, so that we wait for `Worker.c_state`! | |
# We want (the default) laziness, so that we wait for `Worker.c_state`! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f4576aa.
test/threads.jl
Outdated
end | ||
|
||
# Wait on the spawned tasks on the owner | ||
@sync begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this sync point should fail fast, if necessary:
@sync begin | |
Base.Experimental.@sync begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes sense, I refactored the code to use timedwait()
in f4576aa.
I'll leave that for someone more qualified to properly answer, but FWIW if 1.11 is chosen as the next LTS then it'll be possible to upgrade Distributed now that it's an excised stdlib 🐙 |
One other thing I noticed is that this should probably be using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any updates on this?
I believe @JBlaschke was going to have a look at it. In the meantime I see the branch is out of date, so I'll rebase it. |
e3205d8
to
fa9d645
Compare
@JamesWrigley @jonas-schulze I'll be working on this this week. Just getting back up to speed after long travel... |
Are there any updates on this? @JamesWrigley @JBlaschke |
Not from me, still need someone to review it. I believe the hesitation to merge comes from Distributed being used to run the Julia tests, so it's quite critical that this works properly. But in the meantime you can |
fa9d645
to
b140754
Compare
c_state::Condition # wait for state changes | ||
ct_time::Float64 # creation time | ||
conn_func::Any # used to setup connections lazily | ||
@atomic state::WorkerState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if state is always read/written from inside a lock this doesn't need to be atomic as the lock should have the correct barriers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's guaranteed? From a cursory grep through cluster.jl
I see plenty of reads outside of a lock.
Since we are making things threadsafe I would look at all |
Ok, I replaced all uses of |
After thinking about it for a bit, I can't come up with a decent replacement short of basically reimplementing I'd suggest keeping it in for now, we can add support for unwrapping exceptions to |
7cac33f
to
01c33e9
Compare
test/threads.jl
Outdated
# Wait on the spawned tasks on the owner. Note that we use | ||
# timedwait() instead of @sync to avoid deadlocks. | ||
t1 = Threads.@spawn fetch_from_owner(wait, recv) | ||
t2 = Threads.@spawn fetch_from_owner(wait, send) | ||
@test timedwait(() -> istaskdone(t1), 5) == :ok | ||
@test timedwait(() -> istaskdone(t2), 5) == :ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to chime in so late after #101 (comment); I noticed because GitHub unfolded all my previous comments.
I like the timedwait
, which is what I used in JuliaLang/julia#37905. However, the timedwait
has been the main reason (I think) why my first PR was reverted (JuliaLang/julia#38112). The second attempt (https://github.com/JuliaLang/julia/pull/38134/files) didn't use timedwait
. I remain in favor of timedwait
but wanted to refresh the information, as it has been a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. My view is that if this fails we're going to end up with a timeout somewhere no matter what, either in CI or the tests themselves. And my preference would be to have the timeout in the tests so we can have some control over it. I bumped it to 60s in abafb79 but I'm happy to increase that if people think it's too low. @vchuravy, @vtjnash, does that sound ok?
Wee progress update for those following this, after some discussion with @vchuravy on Slack I decided to:
Which I'll get to at... some point. |
This is a rebased version of #4, it should be ready to merge. Fixes #73.
(made after discussing with @jpsamaroo)
CC @vchuravy, @vtjnash