Skip to content
This repository was archived by the owner on Mar 20, 2023. It is now read-only.

NEURON integrated tests failing with CUDA Unified Memory enabled #594

Open
@iomaganaris

Description

@iomaganaris

Describe the issue
Some of the NEURON test are failing on GPU when CUDA Unified Memory is enabled in CoreNEURON.
More precisely the tests that fail are:

The following tests FAILED:
         18 - coreneuron_modtests::direct_py (Failed)
         19 - coreneuron_modtests::direct_hoc (Failed)
         20 - coreneuron_modtests::spikes_py (Failed)
         21 - coreneuron_modtests::spikes_file_mode_py (Failed)
         22 - coreneuron_modtests::datareturn_py (Failed)
         25 - coreneuron_modtests::spikes_mpi_py (Failed)
         26 - coreneuron_modtests::spikes_mpi_file_mode_py (Failed)
         41 - testcorenrn_patstim::coreneuron_gpu_offline (Failed)
         45 - testcorenrn_patstim::compare_results (Failed)
         99 - testcorenrn_netstimdirect::direct (Failed)
        100 - testcorenrn_netstimdirect::compare_results (Failed)

To Reproduce
Steps to reproduce the behavior:

git clone [email protected]:neuronsimulator/nrn.git
cd nrn
mkdir build_unified && cd build_unified
cmake .. -DCMAKE_INSTALL_PREFIX=./install -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_TESTS=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_CU
DA_UNIFIED_MEMORY=ON -DCORENRN_ENABLE_OPENMP=OFF
make -j16
ctest --output-on-failure

Expected behavior
GPU tests should be passing with Unified Memory as well.

Logs
An example of a failing test (coreneuron_modtests::direct_py) when run with cuda-memcheck has the following output:

========= Invalid __global__ read of size 8
=========     at 0x00000730 in /gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x05121860 is out of bounds
=========     Device Frame:/gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) (coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) : 0x730)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so (cuLaunchKernel + 0x34e) [0x2efa6e]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch3 + 0x1d94) [0x1ca43]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so [0x1d7a5]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch + 0x13d) [0x1d8e4]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f7) [0x463c0]
=========     Host Frame:./x86_64/special (_ZN10coreneuron16_nrn_cur__IClampEPNS_9NrnThreadEPNS_9Memb_listEi + 0x89b) [0x5702b]
=========     Host Frame:./x86_64/special [0x17f3bb]
=========     Host Frame:./x86_64/special (_ZN10coreneuron25setup_tree_matrix_minimalEPNS_9NrnThreadE + 0xe) [0x1814ae]
Failing in Thread:1
call to cuLaunchKernel returned error 719: Launch failed (often invalid pointer dereference)

The corresponding line that fails in the stim.cpp:

409:      #pragma acc atomic update
410:      _nt->nrn_fast_imem->nrn_sav_rhs[_nd_idx] += _rhs;
411:      #pragma acc atomic update
412:      _nt->nrn_fast_imem->nrn_sav_d[_nd_idx] -= _g;

System (please complete the following information)

  • OS: RedHat
  • Compiler: NVHPC 21.2
  • Version: master branch
  • Backend: GPU

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions