Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable Linear System Solver on Multiple GPUs #1159

Open
wang-xianghao opened this issue Nov 23, 2024 · 6 comments
Open

Scalable Linear System Solver on Multiple GPUs #1159

wang-xianghao opened this issue Nov 23, 2024 · 6 comments

Comments

@wang-xianghao
Copy link

Hello. I'm building a software need a scalable linear equation solver on cluster with multiple GPUs. The document on linalg.solve (https://docs.nvidia.com/cupynumeric/latest/api/generated/cupynumeric.linalg.solve.html) says Multi-GPU usage is only available when compiled with cusolverMP. May I ask whether there is the instruction on building cuNumeric with cusolverMP? Thanks.

@manopapad
Copy link
Contributor

Hi @wang-xianghao , sorry for the delay in getting back to you.

At the moment you would need to build cuPyNumeric from source using --with-cusolvermp. You would need to get an installation of cuSolverMp from https://developer.nvidia.com/cusolvermp-downloads. Before you can do this, you will need a build of Legate that includes support for the CAL communicator library. We have ongoing work to produce these packages, and expect a nightly build within the next week to start shipping with this support.

Once cuSolverMp packages are uploaded to conda-forge, we will start building cuPyNumeric packages with this support enabled by default, so the pre-built packages you get directly from us would have multi-GPU np.linalg.solve support enabled by default.

@manopapad
Copy link
Contributor

cuSolverMp packages have become available on conda-forge, and CAL support is available in the legate nightly builds since 25.01.00.dev71 (and will be officially included in the 25.01 release packages). For now you can use a nighty build of Legate to try out a cuSolverMp-enabled cupynumeric build. Here's instructions that worked for me:

# asking explicitly for numpy 1.26 because cupynumeric requires numpy<2
# if left unspecified, conda would pull numpy 2+, then the cupynumeric installation process would have to downgrade it
# opt_einsum is missing from cupynumeric's setup.py, this is a bug, for now just pull it manually
conda create -y -n test16 -c legate/label/experimental -c conda-forge legate=25.01.00.dev71 numpy=1.26 opt_einsum cusolvermp
conda activate test16
git clone https://github.com/nv-legate/cupynumeric.git
cd cupynumeric
# have to do this because cupynumeric source from 24.11 looks for legate 24.11, not 25.01
sed -i 's/24.11.01/25.01.0/' ./cmake/thirdparty/get_legate.cmake
./install.py --verbose --with-cusolvermp $CONDA_PREFIX

@wang-xianghao
Copy link
Author

Thanks for your solution. I have successfully built cupynumeric with cusolvermp with your instruction provided. However, when I was testing the following code with 2 A100 GPUs, an error occurred. Many thanks.

Code

import cupynumeric as np
from legate.timing import time

size = 10000
A = np.random.randn(size, size)
x = np.random.randn(size, 1)
b = A @ x

tstart = time()
x_solved = np.linalg.solve(A, b)
tend = time()

telapsed = (tend - tstart) / 1e6
print(f'elapsed time: {telapsed} s')

Run

legate --profile --cpus 16 \
    --gpus 2 --sysmem 256000 \
    --fbmem 30000 \
    --eager-alloc-percentage 10 \
    solve_test.py

Error

[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 0 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 1 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 2 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 3 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 4 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 5 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 6 (32 < 54)
[0 - 7fec0d5f5740]    0.000000 {4}{openmp}: not enough cores in NUMA domain 7 (31 < 54)
[0 - 7fec0d5f5740]    0.000269 {5}{numa}: can't read '/sys/devices/system/node/node-1/distance': No such file or directory
[0 - 7fec0d5f5740]    0.005616 {4}{threads}: reservation ('GPU proc 1d00000000000014') cannot be satisfied
Traceback (most recent call last):
  File "/home/jovyan/workspace/demos/solve_test.py", line 10, in <module>
    x_solved = np.linalg.solve(A, b)
               ^^^^^^^^^^^^^^^^^^^^^
  File "runtime.pyx", line 1204, in legate.core._lib.runtime.runtime.track_provenance.decorator.wrapper
  File "runtime.pyx", line 1205, in legate.core._lib.runtime.runtime.track_provenance.decorator.wrapper
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/_utils/coverage.py", line 105, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/_array/util.py", line 110, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/linalg/linalg.py", line 217, in solve
    return _thunk_solve(a, b, out)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/linalg/linalg.py", line 805, in _thunk_solve
    out._thunk.solve(a._thunk, b._thunk)
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/_thunk/deferred.py", line 154, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/_thunk/deferred.py", line 3470, in solve
    solve_deferred(self, a, b)
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/linalg/_solve.py", line 92, in solve_deferred
    mp_solve(
  File "/opt/saturncloud/envs/test0/lib/python3.12/site-packages/cupynumeric/linalg/_solve.py", line 74, in mp_solve
    task.add_cal_communicator()
  File "task.pyx", line 486, in legate.core._lib.operation.task.AutoTask.add_cal_communicator
  File "task.pyx", line 490, in legate.core._lib.operation.task.AutoTask.add_cal_communicator
  File "task.pyx", line 385, in legate.core._lib.operation.task.AutoTask.add_communicator
RuntimeError: LEGATE ERROR: ================================================================================
LEGATE ERROR: std::runtime_error: No factory available for communicator 'cal'
LEGATE ERROR: System: Linux, 5.15.0-210.163.7.el8uek.x86_64, w-xiang-test4-fa17650841054e0b9db7918bd8f75673-9bff84c89-j29nj, #2 SMP Tue Sep 10 18:31:09 PDT 2024, x86_64
LEGATE ERROR: Legate version: 25.1.0 (a802055a3b91003bc0f284e9eed01b8c7b8ddf79)
LEGATE ERROR: Legion version: 25.1.0 (cc6a40a3177d03762d6006549f020348e656724e)
LEGATE ERROR: Configure options: --LEGATE_ARCH=arch-conda '--CUDAFLAGS=-isystem /opt/saturncloud/envs/test0/include -L/opt/saturncloud/envs/test0/lib' --with-python --with-cc=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-cc --with-cxx=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-c++ --build-march=x86-64 --with-openmp --with-tests --with-benchmarks --with-cuda --with-cal --build-type=release --with-ucx
LEGATE ERROR: Stack trace (most recent call first, top-most exception only):
LEGATE ERROR: #0  0x00007fe7c8c74bdd at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
LEGATE ERROR: #1  0x00007fe7c8c2c003 at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
LEGATE ERROR: #2  0x00007fe7c8d3d797 at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
LEGATE ERROR: #3  0x00007fe7b23192bb at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/operation/task.cpython-312-x86_64-linux-gnu.so
LEGATE ERROR: #4  0x00007fe7b231828f at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/operation/task.cpython-312-x86_64-linux-gnu.so
LEGATE ERROR: #5  0x00007fe7b231853a at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/operation/task.cpython-312-x86_64-linux-gnu.so
LEGATE ERROR: #6  (inlined)          in _PyObject_VectorcallTstate at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_call.h:92:11
LEGATE ERROR: #7  0x0000564beabff1ad in PyObject_Vectorcall at /usr/local/src/conda/python-3.12.8/Objects/call.c:325:12
LEGATE ERROR: #8  0x0000564beaaf86a0 in _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1733407224341/work/build-static/Python/bytecodes.c:2715:19
LEGATE ERROR: #9  0x00007fe7b225a74d at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
LEGATE ERROR: #10 0x00007fe7b2269228 at /opt/saturncloud/envs/test0/lib/python3.12/site-packages/legate/core/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
LEGATE ERROR: #11 0x0000564beabea75a in _PyObject_MakeTpCall at /usr/local/src/conda/python-3.12.8/Objects/call.c:240:18
LEGATE ERROR: #12 0x0000564beaaf86a0 in _PyEval_EvalFrameDefault at /home/conda/feedstock_root/build_artifacts/python-split_1733407224341/work/build-static/Python/bytecodes.c:2715:19
LEGATE ERROR: #13 0x0000564beaca0740 in PyEval_EvalCode at /usr/local/src/conda/python-3.12.8/Python/ceval.c:578:21
LEGATE ERROR: #14 0x0000564beacc4f19 in run_eval_code_obj at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1722:0
LEGATE ERROR: #15 0x0000564beacbfd34 in run_mod at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1743:0
LEGATE ERROR: #16 0x0000564beacd877f in pyrun_file at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1643:0
LEGATE ERROR: #17 0x0000564beacd7dfd in _PyRun_SimpleFileObject at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:433:0
LEGATE ERROR: #18 0x0000564beacd7ac3 in _PyRun_AnyFileObject at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:78:0
LEGATE ERROR: #19 (inlined)          in pymain_run_file_obj at /usr/local/src/conda/python-3.12.8/Modules/main.c:360:0
LEGATE ERROR: #20 (inlined)          in pymain_run_file at /usr/local/src/conda/python-3.12.8/Modules/main.c:379:0
LEGATE ERROR: #21 (inlined)          in pymain_run_python at /usr/local/src/conda/python-3.12.8/Modules/main.c:633:0
LEGATE ERROR: #22 0x0000564beacd0dfd in Py_RunMain at /usr/local/src/conda/python-3.12.8/Modules/main.c:713:0
LEGATE ERROR: #23 0x0000564beac8b0c6 in Py_BytesMain at /usr/local/src/conda/python-3.12.8/Modules/main.c:767:12
LEGATE ERROR: #24 0x00007fec0d623d8f at /lib/x86_64-linux-gnu/libc.so.6
LEGATE ERROR: #25 0x00007fec0d623e3f at /lib/x86_64-linux-gnu/libc.so.6
LEGATE ERROR: #26 0x0000564beac8af70 at /opt/saturncloud/envs/test0/bin/python3.12
LEGATE ERROR: ================================================================================

@manopapad
Copy link
Contributor

I was wrong in my version recommendation above. It appears that the 25.01.00.dev71 package didn't actually have CAL support. I was able to get past the issue above by using version 25.01.00.dev98:

# note: added cutensor
# I actually had to use mamba, because conda threw some internal error, but that's probably just my version of conda being too old
conda create -y -n test19 -c legate/label/experimental -c conda-forge legate=25.01.00.dev98 numpy=1.26 opt_einsum cusolvermp cutensor
conda activate test19
git clone https://github.com/nv-legate/cupynumeric.git
cd cupynumeric
sed -i 's/24.11.01/25.01.0/' ./cmake/thirdparty/get_legate.cmake
./install.py --with-cusolvermp $CONDA_PREFIX

I am still not quite able to run your code. On a single-process run:

(test19) iblis:~/temp/cupynumeric> LEGATE_AUTO_CONFIG=0 legate --gpus 2 --fbmem 20000 a.py

I see errors at CAL initialization:

[1737685179.598877] [iblis:33297:0]         ucc_lib.c:163  UCC  ERROR lib_init failed: no CL libs were opened
Internal CAL failure with error 6 (Error in UCC call) in file /tmp/conda-croot/legate/work/src/cpp/legate/comm/detail/comm_cal.cc at line 128
[1737685300.758692] [iblis:35783:0]   tl_cuda_cache.c:231  UCC  ERROR ipc-cache: failed to open ipc mem handle. addr:0x7c2330c00000 len:16777216 err:201
Internal CUDA failure with error invalid device context (cudaErrorDeviceUninitialized) in file /home/mpapadakis/temp/cupynumeric/src/cupynumeric/utilities/repartition.cu at line 506
[1737685317.877588] [iblis:37283:1]  cuda_ipc_iface.c:93   UCX  ERROR cuDeviceGet(&cu_device, 0) failed: unrecognized error code 4
[1737685317.880880] [iblis:37283:1]         address.c:978  UCX  ERROR failed to unpack address, invalid bandwidth 0.00
[1737685317.881528] [iblis:37283:1] cuda_copy_iface.c:524  UCX  ERROR cuCtxGetCurrent(&cuda_context) failed: unrecognized error code 4
[1737685317.881552] [iblis:37283:1]  tl_ucp_context.c:208  TL_UCP ERROR failed to create ucp worker, Invalid parameter

and a multi-process run:

(test19) iblis:~/temp/cupynumeric> LEGATE_AUTO_CONFIG=0 legate --ranks-per-node 2 --launcher mpirun --gpus 1 --gpu-bind 0/1 --fbmem 20000 a.py

gets stuck at MPI allgather happening during CAL initialization:

(gdb) bt
#0  0x000072b48d333ff1 in opal_progress () from /home/mpapadakis/mambaforge/envs/test19/lib/././libopen-pal.so.40
#1  0x000072b48d33a57e in ompi_sync_wait_mt () from /home/mpapadakis/mambaforge/envs/test19/lib/././libopen-pal.so.40
#2  0x000072b48c8d384e in mca_pml_ob1_recv () from /home/mpapadakis/mambaforge/envs/test19/lib/openmpi/mca_pml_ob1.so
#3  0x000072b48dd5f505 in PMPI_Recv () from /home/mpapadakis/mambaforge/envs/test19/lib/./libmpi.so.40
#4  0x000072b488cc2419 in legate_mpi_recv () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././liblegate_mpi_wrapper.so
#5  0x000072b46d4ccae8 in legate::detail::comm::coll::MPINetwork::gather_(void const*, void*, int, legate::comm::coll::CollDataType, int, legate::comm::coll::Coll_Comm*) () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
#6  0x000072b46d4cd012 in legate::detail::comm::coll::MPINetwork::all_gather(void const*, void*, int, legate::comm::coll::CollDataType, legate::comm::coll::Coll_Comm*) () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
#7  0x000072b46d514dce in ?? () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
#8  0x000072b49361ffa1 in ucc_core_addr_exchange (context=context@entry=0x72a9af1f8a70, oob=oob@entry=0x72a9af1f8a88, addr_storage=addr_storage@entry=0x72a9af1f8b80) at core/ucc_context.c:519
#9  0x000072b493620aec in ucc_context_create_proc_info (lib=0x72a9af1f8270, params=<optimized out>, config=0x72a9b8013fb0, context=0x72a9af1f89f0, proc_info=0x72b49364b2c0 <ucc_local_proc>) at core/ucc_context.c:717
#10 0x000072b45ca8c4bc in ucc::context_wrapper::context_wrapper(cal_comm_create_params&) () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././libcal.so.0
#11 0x000072b45ca88b97 in cal_comm::cal_comm(cal_comm_create_params&) () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././libcal.so.0
#12 0x000072b45ca6115a in cal_comm_create () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././libcal.so.0
#13 0x000072b46d5154ba in void legate::detail::LegionTask<legate::detail::comm::cal::Init>::task_wrapper_<cal_comm*, &legate::detail::comm::cal::Init::gpu_variant, (legate::VariantCode)2>(void const*, unsigned long, void const*, unsigned long, Realm::Processor) () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../../liblegate.so.25.01.00
#14 0x000072b4583ee7f1 in ?? () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././librealm.so.1
#15 0x000072b4583ee866 in ?? () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././librealm.so.1
#16 0x000072b4583ecdca in ?? () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././librealm.so.1
#17 0x000072b4583f0be2 in ?? () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././librealm.so.1
#18 0x000072b4583f4787 in ?? () from /home/mpapadakis/mambaforge/envs/test19/lib/python3.12/site-packages/legate/core/_lib/mapping/../../../../../.././librealm.so.1
#19 0x000072b698894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x000072b698926850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

You may or may not face the same issues on your machine. I suspect it has something to do with the OpenMPI on conda-forge. @mfoerste4 can you reproduce? Any idea what the issue could be here?

@wang-xianghao
Copy link
Author

@manopapad Hello, is there any news about the release with cusolvermp support?

@marcinz
Copy link
Collaborator

marcinz commented Feb 5, 2025

@wang-xianghao I am working with our packaging team to get a cusolvermp package out that does not have a CUDA >= 12.6 requirement. I filed an issue to track this: conda-forge/libcusolvermp-feedstock#2. I am also in an internal communication with the team that is working on that. Hopefully we can get that released soon(ish).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants