title | teaching | exercises | questions | objectives | keypoints | ||||
---|---|---|---|---|---|---|---|---|---|
Kokkos with GPUs |
15 |
15 |
|
|
|
In this episode, we shall learn to how to use GPU acceleration using the Kokkos
package in LAMMPS. In a previous episode, we
have learnt the basic syntax of the package
command that
is used to invoke the Kokkos package in a LAMMPS run. The main arguments and the
corresponding keywords were discussed briefly in that chapter. In this episode, we shall
do practical exercises to get more hands-on experiences on using those commands.
Before proceeding further, let's breakdown the key syntax for calling the GPU acceleration through the Kokkos package is given below.
srun lmp -in in.lj -k on g Ngpu -sf kk -pk kokkos <arguments>
{: .bash}
To run the Kokkos package, the following three command-line switches are very important:
-k on
: This enables Kokkos at runtime-sf kk
: This appends the "/kk" suffix to Kokkos-supported LAMMPS styles-pk kokkos
: This is used to modify the default Kokkos package options
To invoke the GPU(s) with Kokkos, we need an additional command-line switch just following
the -k on
switch as shown below:
4. -k on g Ngpu
: Using this switch you can specify the number of GPU devices, Ngpu
,
that you want to use per node.
- Know your host: get the number of physical cores per node available to you.
- Know your device: know how many GPUs are available on your system and know how to ask for them from your resource manager (SLURM, etc.)
- CUDA-aware MPI: Check if you can use a CUDA-aware MPI runtime with your LAMMPS executable. If not then you need to add
cuda/aware no
to your<arguments>
. {: .callout}
Derive a command-line to submit a LAMMPS job for the LJ system that you studied for the GPU package such that it invokes the Kokkos GPU to accelerate the job using 2 nodes having 24 cores each, 4 devices per node. Assign all the MPI ranks available on a node to all the devices. Use default package options.
lmp -k on g 4 -sf kk -pk kokkos -in in.lj
{: .bash} {: .solution} {: .challenge}
This information is collected from the LAMMPS website
- Hardware comptibility: For better performance, you must use Kepler or later generations of GPUs.
- MPI tasks per GPU: You should use one MPI task per GPU because Kokkos tries to run everything on the GPU, including the integrator and other fixes/computes. One may get better performance by assigning multiple MPI tasks per GPU if some styles used in the input script have not yet been Kokkos-enabled.
- CUDA-aware MPI library: Using this can provide significant performance gain. If this is not available, set it
off
using the-pk kokkos cuda/aware no
switch.neigh
andnewton
: For Kokkos/GPU, the default isneigh = full
andnewton = off
. For Maxwell and Kepler generations of GPUs, the default settings are typically the best. For Pascal generations, settingneigh = half
andnewton = on
might produce faster runs.- binsize: For many pair styles, setting the value of
binsize
to twice that used for the CPU styles could offer speedup (and this is the default for the Kokkos/GPU style)- Avoid mixing Kokkos and non-Kokkos styles: In the LAMMPS input file, if you use styles that are not ported to use Kokkos, you may experience a significant loss in performance. This performance penalty occurs because it causes the data to be copied back to the CPU repeatedly. {: .callout}
In the following discussion, we'll work on a few exercises to get familiarized on some of these aspects to some extent.
First, let us take the input for the LJ-system from episode 5. Run this input using all the visible devices in a node available to you and use Kokkos/GPU as the accelerator package using the following settings; CHECK ME!
- 4 GPUs
- Kokkos on
newton off
neigh full
comm device
cude/aware off
Use the number of MPI tasks that equals to the number of devices. Measure the performance of of this run in
timesteps/s
.Modify the LJ-input file and append the following lines to the end of the file.
... ... ... ... ... ... neighbor 0.3 bin neigh_modify delay 0 every 20 check no compute 1 all coord/atom cutoff 2.5 compute 2 all reduce sum c_1 variable acn equal c_2/atoms fix 1 all nve thermo 50 thermo_style custom step time temp press pe ke etotal density v_acn run 500
{: .source}
Rename the input file and run it using the same Kokkos setting as before, and the identical number of GPU and MPI tasks as previously. Measure the performance of this run in
timesteps/s
and compare the performance of these two runs and comment on your observations.Taking an example from a HPC system with 2x12 cores per node and 2 GPUs (4 visible devices per node), using 1 MPI task per GPU, the following was observed.
First, we ran with Input1. Second, we modified this input as mentioned above (to become Input2) and performance for both of these runs are measured in units of
timesteps/s
. We can get this information from the log/screen output files. The comparison of performance is given in this table:
Input Performance (timesteps/sec) Performance loss by a factor of Input 1 (all Kokkos enabled styles used) 8.097 Input 2 (non-Kokkos style used: compute coord/atom
)3.022 2.68 In Input2 we have used styles that is not yet ported to Kokkos. We can check this from the log/screen output files:
(1) pair lj/cut/kk, perpetual attributes: full, newton off, kokkos_device pair build: full/bin/kk/device stencil: full/bin/3d bin: kk/device (2) compute coord/atom, occasional attributes: full, newton off pair build: full/bin/atomonly stencil: full/bin/3d bin: standard
{: .output}
In this case, the pair style is Kokkos-enabled (
pair lj/cut/kk
) while the compute stylecompute coord/atom
is not. Whenever you make such a mix of Kokkos and non-Kokkos styles in the input of a Kokkos run, it costs you dearly since this requires the data to be copied back to the host incurring performance penalty. {: .solution} {: .challenge}
We have already discussed that the primary aim of developing the Kokkos package is to write a single C++ code that will run on both devices (like GPU) and hosts (CPU) with or without multi-threading. Targeting portability without losing the functionality and the performance of a code is the primary objective of Kokkos.
Let us see now see how the current Kokkos/GPU implementation within LAMMPS (version
3Mar20
) achieves this goal by comparing its performance with the CPU and GPU package.For this, we shall repeat the same set of tasks as described in episode 5. Take a LJ-system with ~11 million atons by choosing
x = y = z = 140
andt = 500
. We'll Use the optimum number of GPU devices and MPI tasks to run the jobs with Kokkos/GPU with 1 node, then any of 2, 3, 4, 5 nodes (2 sets: one with the GPU package enabled, and the other is the regular MPI-based runs without any accelerator package). For a better comparison of points, choose a different multi-node number to that of your neighbour.Kokkos/GPU is also specially designed to run everything on the GPUs. We shall offload the entire force computation and neighbour list building to the GPUs using;
-k on g 4 -sf kk -pk kokkos newton off neigh full comm device
{: .bash}
or
-k on g 4 -sf kk -pk kokkos newton off neigh full comm device cuda/aware off
{: .bash}
(if CUDA-aware MPI is not available to you).
Extract the performance data from the log/screen output files from each of these runs. You can do this using the command
grep "Performance:" log.lammps
{: .bash} and note down the performance value in units of
timestep/s
.Make a plot to compare the performance of the Kokkos/GPU runs with the CPU runs (i.e. without any accelerator package) and the GPU runs (i.e. with the GPU package enabled) with number of nodes.
Plot the speed-up factor (= GPU performance/CPU performance) versus the number of nodes.
Discuss the main observations from these plots.
FIXME {: .solution} {: .challenge}