Initial GPU port based on CUDA.#22
Closed
sfantao wants to merge 1 commit intocode-saturne:masterfrom
Closed
Conversation
Contributor
|
Although this patch was never directly merged in code_saturne, as some bugs remained and it had become obsolete by the time the required contributor licence agreement (for dual licencing) was "almost" finalized, it served as a test bed in 2018-2020 for further tests on code_saturne GPU support, as a proof of concept for GPU support, and as a reference example for EDF's own GPU work. So this work was very useful, and this merge request can be closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch introduces acceleration code in Code_Saturne for NVIDIA GPUs. This is a partial port in the sense that a limited set of testcases are supported.
This has been tested on OpenPOWER platforms but it should work as well for other platforms that support CUDA. We tested this on both Power8 + P100 and Power9 + V100 machines. For the former you should expect to see over 2x speedup at scale if the amount of cells per GPU is over 100k. For the latter that speedup goes up to at least 3x while providing better strong scaling - we tested the code successfully on Summit Supercomputer in Oak Ridge National Lab in up to 512 nodes.
The overall idea is to reduce the effect of latencies in the code for the different vector and matrix-vector operations. We employ a template packing technique to statically bundle multiple operations in the same CUDA kernel. Also, we create data environments to keep data in the GPU for longer.
The implementation introduces the implementation of the GPU acceleration port in
/src/cudaand the various entry points are invoked from all around the code.This code is prepared to be launched with NVIDIA Multi-Process Service (MPS) so that multiple ranks can use the same GPU. I tested this successfully with up to 5 ranks per GPU. In order for this to work CUDA GPU visibilities have to be such that each rank only sees the GPU it is meant to use.
The patch introduces a way to assess the number of local ranks which expects an OpenMPI compatible environment - e.g. SpectrumMPI from IBM Spectrum Scale.
The patch also introduces changes in the build system so that the code can be easily built with GPU support. Building without GPU support would be equivalent to run Code Saturne in its current version: with CPU-only support.
To build the code you should use a C/C++ compiler that supports C++11 as the CUDA code requires that. Here's an example on how to build the code (note the
--enable-cuda-offloadflag):To run with MPS support, there are multiple ways. We used both Spectrum Scale LSF and LSF+CSM. Here is an example of LSF script to submit at job:
Here,
../../cs_solver_gpuis a proxy script that starts MPS servers (one per GPU) and launches thecs_solverapplication. Here are its contents:One MPS server per GPU may be overkill, 2 per GPU is in most cases sufficient.
We tested the code with a cavity load flow. Here is an example using a 13M mesh:
https://ibm.box.com/s/2rhbavxqgxhvrfi4ws98w36h74i7aqat
To run it, download the testcase from this link and then launch the job from
cs_test/SRCas in the LSF script above.