-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalable Linear System Solver on Multiple GPUs #1159
Comments
Hi @wang-xianghao , sorry for the delay in getting back to you. At the moment you would need to build cuPyNumeric from source using Once cuSolverMp packages are uploaded to |
cuSolverMp packages have become available on conda-forge, and CAL support is available in the legate nightly builds since 25.01.00.dev71 (and will be officially included in the 25.01 release packages). For now you can use a nighty build of Legate to try out a cuSolverMp-enabled cupynumeric build. Here's instructions that worked for me:
|
Thanks for your solution. I have successfully built cupynumeric with cusolvermp with your instruction provided. However, when I was testing the following code with 2 A100 GPUs, an error occurred. Many thanks. Codeimport cupynumeric as np
from legate.timing import time
size = 10000
A = np.random.randn(size, size)
x = np.random.randn(size, 1)
b = A @ x
tstart = time()
x_solved = np.linalg.solve(A, b)
tend = time()
telapsed = (tend - tstart) / 1e6
print(f'elapsed time: {telapsed} s') Runlegate --profile --cpus 16 \
--gpus 2 --sysmem 256000 \
--fbmem 30000 \
--eager-alloc-percentage 10 \
solve_test.py Error
|
I was wrong in my version recommendation above. It appears that the 25.01.00.dev71 package didn't actually have CAL support. I was able to get past the issue above by using version 25.01.00.dev98:
I am still not quite able to run your code. On a single-process run:
I see errors at CAL initialization:
and a multi-process run:
gets stuck at MPI allgather happening during CAL initialization:
You may or may not face the same issues on your machine. I suspect it has something to do with the OpenMPI on conda-forge. @mfoerste4 can you reproduce? Any idea what the issue could be here? |
@manopapad Hello, is there any news about the release with cusolvermp support? |
@wang-xianghao I am working with our packaging team to get a cusolvermp package out that does not have a CUDA >= 12.6 requirement. I filed an issue to track this: conda-forge/libcusolvermp-feedstock#2. I am also in an internal communication with the team that is working on that. Hopefully we can get that released soon(ish). |
Hello. I'm building a software need a scalable linear equation solver on cluster with multiple GPUs. The document on
linalg.solve
(https://docs.nvidia.com/cupynumeric/latest/api/generated/cupynumeric.linalg.solve.html) says Multi-GPU usage is only available when compiled with cusolverMP. May I ask whether there is the instruction on building cuNumeric with cusolverMP? Thanks.The text was updated successfully, but these errors were encountered: