Improve OpenMP offload implementation #729
Description
Overview
In #713 we have added support for GPU offload using OpenMP. This is a good first step, but there are several areas where we hope to improve the implementation. This issue is to track planned improvements.
Asynchronous execution
In #713 we did not include any asynchronous execution clauses for OpenMP-based accelerator offload (nowait
, depend
, taskwait
). This was partly for simplicity, and partly because support for those clauses in the compiler we were using at the time (NVHPC 21.9) is rather limited.
Work has already started on this, see:
- Support async execution in OpenMP wherever it's supported #725
- Changes for async execution of OpenACC and OpenMP nmodl#788
- Changes to support async execution on GPU with OpenACC mod2c#75
Initially we should aim to recover the performance attained with (asynchronous) OpenACC.
After that, we could look at launching more mechanism kernels in parallel within a single NrnThread.
Present clauses
With OpenACC we had present(...)
clauses that allowed us to assert that data were already present on the device and should not be copied. The current OpenMP implementation has no such equivalent, but we basically preserve the same data transfer pattern as OpenACC because we ensure that the data are already present.
In principle a bug in the model transfer code (leading to some relevant data not being transferred to the device during initialisation) would cause a runtime error with OpenACC (✅) and implicit data transfers with OpenMP (⛔). Given that we already know how to generate present()
clauses, it seems desirable to add the OpenMP equivalent (map(present, alloc: ...)
) once it is widely supported.
(original issue: neuronsimulator/gpuhackathon#5)