Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Such threadsafe, much wow #4

Merged
merged 9 commits into from
Nov 14, 2024
Merged

Conversation

JamesWrigley
Copy link
Collaborator

Relanded from: JuliaLang/Distributed.jl#101

As discussed offline this skips the last couple of commits to replace @async with Threads.@spawn, though I did keep f98c4ca since I figured it couldn't hurt.

@JamesWrigley JamesWrigley self-assigned this Oct 29, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.

Project coverage is 86.63%. Comparing base (76df474) to head (c1a3be8).
Report is 38 commits behind head on master.

Files with missing lines Patch % Lines
src/cluster.jl 92.30% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master       #4      +/-   ##
==========================================
+ Coverage   86.20%   86.63%   +0.42%     
==========================================
  Files          10       10              
  Lines        1965     1983      +18     
==========================================
+ Hits         1694     1718      +24     
+ Misses        271      265       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jpsamaroo jpsamaroo mentioned this pull request Oct 29, 2024
8 tasks
src/cluster.jl Outdated
Comment on lines 663 to 667
if jw.state === W_CREATED
lock(jw.c_state) do
wait(jw.c_state)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially incorrect; shouldn't the equality comparison be within the lock block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I think you're right, especially since the docstring for c_state says:

    c_state::Threads.Condition # wait for state changes, lock for state

Fixed in 9283e6f. But I see that Worker.state is an atomic, should we also be doing atomic reads of that variable everywhere else too? AFAICT most reads are currently non-atomic.

Copy link
Member

@jpsamaroo jpsamaroo Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably switch to acquire-release, or keep everything seq-cst for now (reads without @atomic are :monotonic, so we should probably do reads as @atomic :sequentially_consistent jw.state).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire-release is probably more than strong enough for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, fixed in 8e35ba4. I stuck to seq-cst just because that's the default for @atomic.

@@ -0,0 +1,64 @@
using Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to also enable JULIA_NUM_THREADS in CI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to because this test sets up its own workers with multiple threads in the exeflags variable. But I also don't have any objection to starting every worker in the tests with multiple threads anyway, maybe it'd be easier to shake out bugs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to because this test sets up its own workers with multiple threads in the exeflags variable.

Ahh yes, I missed that!

But I also don't have any objection to starting every worker in the tests with multiple threads anyway, maybe it'd be easier to shake out bugs?

It's up to you - I think it could be valuable to try, but maybe we wait for the @async -> Threads.@spawn change, to be able to see what happens when tasks get assigned to other threads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we might as well go for it now, added in 8e35ba4.

@JamesWrigley
Copy link
Collaborator Author

JamesWrigley commented Oct 31, 2024

BTW please don't merge this with the fixup commits, once it's ready to merge I'll rebase and squash them.

@JamesWrigley
Copy link
Collaborator Author

Definitely some fishy failures in the latest CI run.

  • 1.11 on OSX outright failed during what looks like the SSH tests. Not sure whether that's a bug in LibSSH.jl or DistributedNext.
  • Even though all the others passed, on some there's messages like this:
    ┌ Warning: Failed to gracefully kill worker 3, sending SIGQUIT
    └ @ DistributedNext ~/work/DistributedNext.jl/DistributedNext.jl/src/managers.jl:770
  • And I didn't notice before, but there's some tests that are explicitly disabled when Julia is run with multiple threads.

@JamesWrigley
Copy link
Collaborator Author

Can confirm that LibSSH is pretty thread-unsafe, so I put the SSH tests into their own file to be executed with a single thread in a2e7b7a. And I enabled all the other thread-unsafe tests in 02fe4d5.

@JamesWrigley
Copy link
Collaborator Author

Now all the SSH tests pass, but:

  • We still see the Failed to gracefully kill worker messages on nightly.
  • Windows x86 had a worker failure (which somehow didn't seem to affect the tests):
         Testing Running tests...
          From worker 4:	Unhandled Task ERROR: EOFError: read end of file
          From worker 4:	Stacktrace:
          From worker 4:	 [1] wait
          From worker 4:	   @ .\asyncevent.jl:162 [inlined]
          From worker 4:	 [2] sleep
          From worker 4:	   @ .\asyncevent.jl:272 [inlined]
          From worker 4:	 [3] (::DistributedNext.var"#33#36"{DistributedNext.Worker, Float64})()
          From worker 4:	   @ DistributedNext D:\a\DistributedNext.jl\DistributedNext.jl\src\cluster.jl:198
    https://github.com/JuliaParallel/DistributedNext.jl/actions/runs/11630388640/job/32389220599?pr=4#step:6:120

I'm going to refactor the tests into @testsets so it's easier to see exactly where these are coming from.

@JamesWrigley
Copy link
Collaborator Author

Put all the tests in @testset's in 8d932fe. That uncovered a couple warnings, one from the topology tests where it was repeatedly overwriting a function with @everywhere, and another in Julia's vendored OffsetArrays code which should be fixed by JuliaLang/julia#56414.

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch 3 times, most recently from b5dcd9d to 68482e6 Compare November 2, 2024 20:57
@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch from 2ea7f38 to 880eebe Compare November 14, 2024 20:26
vchuravy and others added 8 commits November 14, 2024 21:37
Necessary because LibSSH is not thread-safe.
This makes it much easier to see where errors/warnings are coming from. The
tests have been preserved in the exact order they were written, with no changes
other than the necessary ones to put them in `@testset`'s (e.g. creating modules
in global scope).
@JamesWrigley
Copy link
Collaborator Author

No luck with the rmprocs() timeout on nightly yet (#6), but as discussed offline I'll merge this now and revisit it if it's still a problem on 1.12. I think everything else is in a decent state.

@JamesWrigley JamesWrigley merged commit b9a8000 into master Nov 14, 2024
11 checks passed
@JamesWrigley JamesWrigley deleted the jps/threadsafe_workerstate branch November 14, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants