Initial GPU port based on CUDA. by sfantao · Pull Request #22 · code-saturne/code_saturne

sfantao · 2018-10-31T15:09:55Z

This patch introduces acceleration code in Code_Saturne for NVIDIA GPUs. This is a partial port in the sense that a limited set of testcases are supported.

This has been tested on OpenPOWER platforms but it should work as well for other platforms that support CUDA. We tested this on both Power8 + P100 and Power9 + V100 machines. For the former you should expect to see over 2x speedup at scale if the amount of cells per GPU is over 100k. For the latter that speedup goes up to at least 3x while providing better strong scaling - we tested the code successfully on Summit Supercomputer in Oak Ridge National Lab in up to 512 nodes.

The overall idea is to reduce the effect of latencies in the code for the different vector and matrix-vector operations. We employ a template packing technique to statically bundle multiple operations in the same CUDA kernel. Also, we create data environments to keep data in the GPU for longer.

The implementation introduces the implementation of the GPU acceleration port in /src/cuda and the various entry points are invoked from all around the code.

This code is prepared to be launched with NVIDIA Multi-Process Service (MPS) so that multiple ranks can use the same GPU. I tested this successfully with up to 5 ranks per GPU. In order for this to work CUDA GPU visibilities have to be such that each rank only sees the GPU it is meant to use.

The patch introduces a way to assess the number of local ranks which expects an OpenMPI compatible environment - e.g. SpectrumMPI from IBM Spectrum Scale.

The patch also introduces changes in the build system so that the code can be easily built with GPU support. Building without GPU support would be equivalent to run Code Saturne in its current version: with CPU-only support.

To build the code you should use a C/C++ compiler that supports C++11 as the CUDA code requires that. Here's an example on how to build the code (note the --enable-cuda-offload flag):

module load openmpi gcc/6.4.0 cuda

git clone https://github.com/sfantao/code_saturne src

cd src && ./sbin/bootstrap
cd -

mkdir obj && cd obj
../src/configure \
--disable-shared \
--enable-static \
--enable-openmp \
--enable-openmp \
--enable-cuda-offload \
--enable-long-gnum \
--host=ppc64le \
--build=ppc64le \
--without-modules \
--disable-gui \
--without-libxml2 \
--without-hdf5 \
--without-salome-kernel \
--without-salome-gui \
--prefix=`pwd`/../install    \
CC=mpicc CFLAGS="-g -O3" \
CXX=mpic++ CXXFLAGS="-g -O3" \
FC=mpifort FCFLAGS="-g -O3" && make install

To run with MPS support, there are multiple ways. We used both Spectrum Scale LSF and LSF+CSM. Here is an example of LSF script to submit at job:

#!/bin/bash
#
#BSUB -J code_saturne_gpu_templ  # job name
#BSUB -W 01:30                   # wall-clock time (hrs:mins)
#BSUB -q normal                  # queue
#BSUB -e errors.%J.log           # error file name in which %J is replaced by the job ID
#BSUB -oo output.%J.log          # output file name in which %J is replaced by the job ID
# #BSUB -x                         # exclusive mode
#BSUB -n 20                      # number of tasks in job - we need to have a multiple 4 jobs per node
#BSUB -R "span[ptile=20]"        # make sure we have the same number of ranks in each node 
#BSUB -gpu "num=4:mode=shared"   # activate the 4 GPUs
#---------------------------------------

ulimit -s 10240
export NUM_PROCS=`echo "$LSB_HOSTS" | wc -w`

unset OPAL_OUTPUT_REDIRECT
export BIND_THREADS=yes

# Change the number of OpenMP threads as required.
export OMP_NUM_THREADS=8

# Run the solver making sure they are distributed by socket.
mpirun --report-bindings --map-by socket --bind-to core --rank-by core -np $NUM_PROCS ../../cs_solver_gpu &> myout.log

Here, ../../cs_solver_gpu is a proxy script that starts MPS servers (one per GPU) and launches the cs_solver application. Here are its contents:

#!/bin/bash

if [ -z "$OMPI_COMM_WORLD_LOCAL_SIZE" ]; then
  let OMPI_COMM_WORLD_LOCAL_SIZE=1
  let OMPI_COMM_WORLD_LOCAL_RANK=0
fi

Devices=`nvidia-smi | grep Tesla | wc -l`
Sockets=`lscpu | grep Socket | sed 's/[^0-9]*//g'`

# The code is prepared to read the number of devices in 
# the system from this variable. 
export CS_NUMBER_OF_GPUS_IN_THE_SYSTEM=$Devices

# We assume that ranks are distributed by socket.

# Ranks per device is the ceiling of #Ranks / #Devices
let RanksPerDevice=(OMPI_COMM_WORLD_LOCAL_SIZE+Devices-1)/Devices
let DeviceID=OMPI_COMM_WORLD_LOCAL_RANK/RanksPerDevice

# We select one rank to start the MPS server for a given device.
let NotDeviceMaster=OMPI_COMM_WORLD_LOCAL_RANK%RanksPerDevice

#---------------------------------------------
# start MPS
#---------------------------------------------
if [ $NotDeviceMaster = 0 ]; then
  if [ $OMPI_COMM_WORLD_RANK = 0 ]; then
    echo starting mps ...
  fi

  rm -rf /dev/shm/${USER}/mps_$DeviceID
  rm -rf /dev/shm/${USER}/mps_log_$DeviceID
  mkdir -p /dev/shm/${USER}/mps_$DeviceID
  mkdir -p /dev/shm/${USER}/mps_log_$DeviceID
  export CUDA_VISIBLE_DEVICES=$DeviceID
  export CUDA_MPS_PIPE_DIRECTORY=/dev/shm/${USER}/mps_$DeviceID
  export CUDA_MPS_LOG_DIRECTORY=/dev/shm/${USER}/mps_log_$DeviceID
  /usr/bin/nvidia-cuda-mps-control -d
fi

# Make sure that all ranks start only when the MPS server started.
sleep 5

#---------------------------------------------
# set CUDA_MPS_PIPE_DIRECTORY per MPI rank
#---------------------------------------------
printf -v myfile "/dev/shm/${USER}/mps_%d" $DeviceID

echo "Rank $OMPI_COMM_WORLD_LOCAL_RANK is using device $DeviceID"

export CUDA_MPS_PIPE_DIRECTORY=$myfile
unset CUDA_VISIBLE_DEVICES

#---------------------------------------------
# run the program
#---------------------------------------------
./cs_solver

#---------------------------------------------
# stop  MPS
#---------------------------------------------
if [ $NotDeviceMaster = 0 ]; then
  if [ $OMPI_COMM_WORLD_RANK = 0 ]; then
    echo stoping mps ...
  fi

  export CUDA_MPS_PIPE_DIRECTORY=/dev/shm/${USER}/mps_$DeviceID
  echo "quit" | /usr/bin/nvidia-cuda-mps-control
  sleep 1
  rm -rf /dev/shm/${USER}/mps_$DeviceID
  rm -rf /dev/shm/${USER}/mps_log_$DeviceID

fi

One MPS server per GPU may be overkill, 2 per GPU is in most cases sufficient.

We tested the code with a cavity load flow. Here is an example using a 13M mesh:

https://ibm.box.com/s/2rhbavxqgxhvrfi4ws98w36h74i7aqat

To run it, download the testcase from this link and then launch the job from cs_test/SRC as in the LSF script above.

YvanFournier · 2025-10-19T20:43:05Z

Although this patch was never directly merged in code_saturne, as some bugs remained and it had become obsolete by the time the required contributor licence agreement (for dual licencing) was "almost" finalized, it served as a test bed in 2018-2020 for further tests on code_saturne GPU support, as a proof of concept for GPU support, and as a reference example for EDF's own GPU work.

So this work was very useful, and this merge request can be closed.

Initial GPU port based on CUDA.

d9fe625

YvanFournier closed this Oct 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial GPU port based on CUDA.#22

Initial GPU port based on CUDA.#22
sfantao wants to merge 1 commit intocode-saturne:masterfrom
sfantao:ibm-upstream

sfantao commented Oct 31, 2018 •

edited

Loading

Uh oh!

YvanFournier commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sfantao commented Oct 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YvanFournier commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfantao commented Oct 31, 2018 •

edited

Loading