This repository was archived by the owner on Mar 20, 2023. It is now read-only.
Dynamic MPI library built with OpenACC flags result into crash at the end of simulation #674
Open
Description
Describe the issue
When dynamic MPI support is enabled, we build lincorenrnmpi_.so library. If this library is built with OpenACC flags (e.g. -acc
) then program crashes at the exit handler:
salloc --account=proj16 --partition=prod_p2 --time=08:00:00 --nodes=1 --constraint=v100 --gres=gpu:4 -n 40 --mem 0 --exclusive
module purge
module load unstable nvhpc/21.2 hpe-mpi cuda cmake
git clone --depth 1 [email protected]:neuronsimulator/nrn.git
git clone --depth 1 [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron && mkdir BUILD && cd BUILD
cmake -DCORENRN_ENABLE_DYNAMIC_MPI=ON -DCMAKE_CXX_FLAGS="-acc" -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvcc ..
./bin/nrnivmodl-core ../../nrn/test/coreneuron/mod/
srun -n 1 ./x86_64/special-core --mpi -d ../coreneuron/tests/integration/ring
.....
....
Solver Time : 0.0748029
Simulation Statistics
Number of cells: 5
Number of compartments: 115
Number of presyns: 28
Number of input presyns: 0
Number of synapses: 15
Number of point processes: 38
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 9
Number of spikes with non negative gid-s: 9
CoreNEURON run
.....
...
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
Process ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn
MPT Version: HPE HMPT 2.22 03/31/20 16:17:35
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/33265/exe, process 33265
MPT: [New LWP 33310]
MPT: [New LWP 33309]
MPT: [New LWP 33283]
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
....
MPT: done.
MPT: 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: debuginfo-install bbp-nvidia-driver-470.57.02-2.x86_64 glibc-2.17-324.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libibverbs-54mlnx1-1.54103.x86_64 libnl3-3.2.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 nss-softokn-freebl-3.53.1-6.el7_9.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64
MPT: (gdb) #0 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaab216a3e6 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffff67d0 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn\n\tMPT Version: HPE HMPT 2.22 03/31/2"...) at sig.c:340
MPT: #3 0x00002aaab216a5d8 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab33e0080) at sig.c:489
MPT: #4 0x00002aaab216a8b3 in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5 <signal handler called>
MPT: #6 0x00002aaaabcc2cd2 in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #7 0x00002aaaabcc6614 in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #8 0x00002aaaabcb61bc in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #9 0x00002aaaabcb7cdb in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #10 0x00002aaaab984da7 in __pgi_uacc_cuda_unregister_fat_binary (
MPT: pgi_cuda_loc=0x2aaaaacb5a40 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:649
MPT: #11 0x00002aaaab984d46 in __pgi_uacc_cuda_unregister_fat_binaries ()
MPT: at ../../src/cuda_init.c:635
MPT: #12 0x00002aaaae553ce9 in __run_exit_handlers () from /lib64/libc.so.6
MPT: #13 0x00002aaaae553d37 in exit () from /lib64/libc.so.6
MPT: #14 0x00002aaaab15b264 in hoc_quit () at /root/nrn/src/oc/hoc.cpp:1177
MPT: #15 0x00002aaaab1425f4 in hoc_call () at /root/nrn/src/oc/code.cpp:1389
MPT: #16 0x00002aaab3f7747e in _INTERNAL_37__root_nrn_src_nrnpython_nrnpy_hoc_cpp_629d835d::fcall () at /root/nrn/src/nrnpython/nrnpy_hoc.cpp:692
MPT: #17 0x00002aaaab0ddf35 in OcJump::fpycall ()
MPT: at /root/nrn/src/nrniv/../ivoc/ocjump.cpp:222
To Reproduce
See the instructions above
Expected behavior
With or without -acc flag, shared library should work fine.
System (please complete the following information)
- System/OS: BB5
- Compiler: NVHPC 21.2
- Version: master, just add -acc flag to mpi library as well
- Backend: GPU
Additional context
We should provide a small reproducer to NVIDIA dev forum.