This repository was archived by the owner on Mar 20, 2023. It is now read-only.
NEURON integrated tests failing with CUDA Unified Memory enabled #594
Open
Description
Describe the issue
Some of the NEURON test are failing on GPU when CUDA Unified Memory is enabled in CoreNEURON.
More precisely the tests that fail are:
The following tests FAILED:
18 - coreneuron_modtests::direct_py (Failed)
19 - coreneuron_modtests::direct_hoc (Failed)
20 - coreneuron_modtests::spikes_py (Failed)
21 - coreneuron_modtests::spikes_file_mode_py (Failed)
22 - coreneuron_modtests::datareturn_py (Failed)
25 - coreneuron_modtests::spikes_mpi_py (Failed)
26 - coreneuron_modtests::spikes_mpi_file_mode_py (Failed)
41 - testcorenrn_patstim::coreneuron_gpu_offline (Failed)
45 - testcorenrn_patstim::compare_results (Failed)
99 - testcorenrn_netstimdirect::direct (Failed)
100 - testcorenrn_netstimdirect::compare_results (Failed)
To Reproduce
Steps to reproduce the behavior:
git clone [email protected]:neuronsimulator/nrn.git
cd nrn
mkdir build_unified && cd build_unified
cmake .. -DCMAKE_INSTALL_PREFIX=./install -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_TESTS=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_CU
DA_UNIFIED_MEMORY=ON -DCORENRN_ENABLE_OPENMP=OFF
make -j16
ctest --output-on-failure
Expected behavior
GPU tests should be passing with Unified Memory as well.
Logs
An example of a failing test (coreneuron_modtests::direct_py
) when run with cuda-memcheck
has the following output:
========= Invalid __global__ read of size 8
========= at 0x00000730 in /gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x05121860 is out of bounds
========= Device Frame:/gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) (coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) : 0x730)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so (cuLaunchKernel + 0x34e) [0x2efa6e]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch3 + 0x1d94) [0x1ca43]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so [0x1d7a5]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch + 0x13d) [0x1d8e4]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f7) [0x463c0]
========= Host Frame:./x86_64/special (_ZN10coreneuron16_nrn_cur__IClampEPNS_9NrnThreadEPNS_9Memb_listEi + 0x89b) [0x5702b]
========= Host Frame:./x86_64/special [0x17f3bb]
========= Host Frame:./x86_64/special (_ZN10coreneuron25setup_tree_matrix_minimalEPNS_9NrnThreadE + 0xe) [0x1814ae]
Failing in Thread:1
call to cuLaunchKernel returned error 719: Launch failed (often invalid pointer dereference)
The corresponding line that fails in the stim.cpp
:
409: #pragma acc atomic update
410: _nt->nrn_fast_imem->nrn_sav_rhs[_nd_idx] += _rhs;
411: #pragma acc atomic update
412: _nt->nrn_fast_imem->nrn_sav_d[_nd_idx] -= _g;
System (please complete the following information)
- OS: RedHat
- Compiler: NVHPC 21.2
- Version: master branch
- Backend: GPU