-
Notifications
You must be signed in to change notification settings - Fork 5
Dual test for NumPy and CuPy in tests #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This looks like our usual problem with CuPy+MPI... the infinity norms use pylops-mpi/pylops_mpi/DistributedArray.py Line 698 in b5a4c54
recv_buf from the _allreduce_subcomm calls, but this leads to deadlock also for Numpy+MPI (which is probably the reason the code was written like this in first place)... the issue with using recv_buf is that the _allreduce_subcomm method uses self.sub_comm.Allreduce that we know does not play well with CuPy arrays for now as we are not doing any syncronization.
Found issue in PyLops - fixed here PyLops/pylops#689. For now (until the next PyLops release) we can easily fix the test by using |
Also I think for all tests we need to add some logic like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk good job!
I think this is nearly ready to go. I would maybe add some targets in the Makefile like we have in pylops https://github.com/PyLops/pylops/blob/a94ea8eae3b9c06bf39637b2e29f6a45a0e7766f/Makefile#L54 and in the contributing part of the documentation (see again what we have in pylops and maybe also add something about the NCCL tests and examples which I just realized is missing)
@hongyx11 I think this is pretty much ready and a great addition to our test suite as we move forwards trying to change MPI methods from objects to buffers… do you think we can put this into a self-hosted runner like we did for Pylops… I think a single node with even just 2 GPUs would be enough as if will guarantee that we can do some checks on any change we make in the communication bits of our library 😀 |
it's doable, let me give it a try, we need to use srun |
Some bugs were found on