title

teaching

exercises

questions

objectives

keypoints

Kokkos with GPUs

15

How do I use Kokkos on a GPU?

Utilise Kokkos on a specific GPU

Knowing the capabilities of your host, device and if you can use a CUDA-aware MPI runtime is required before starting a GPU run

Using GPU acceleration through the Kokkos package

In this episode, we shall learn to how to use GPU acceleration using the Kokkos package in LAMMPS. In a previous episode, we have learnt the basic syntax of the package command that is used to invoke the Kokkos package in a LAMMPS run. The main arguments and the corresponding keywords were discussed briefly in that chapter. In this episode, we shall do practical exercises to get more hands-on experiences on using those commands.

Command-line options to submit a Kokkos GPU job in LAMMPS

Before proceeding further, let's breakdown the key syntax for calling the GPU acceleration through the Kokkos package is given below.

srun lmp -in in.lj -k on g Ngpu -sf kk -pk kokkos <arguments>

{: .bash}

To run the Kokkos package, the following three command-line switches are very important:

-k on : This enables Kokkos at runtime
-sf kk : This appends the "/kk" suffix to Kokkos-supported LAMMPS styles
-pk kokkos : This is used to modify the default Kokkos package options

To invoke the GPU(s) with Kokkos, we need an additional command-line switch just following the -k on switch as shown below: 4. -k on g Ngpu: Using this switch you can specify the number of GPU devices, Ngpu, that you want to use per node.

Before you start

Know your host: get the number of physical cores per node available to you.

Know your device: know how many GPUs are available on your system and know how to ask for them from your resource manager (SLURM, etc.)

CUDA-aware MPI: Check if you can use a CUDA-aware MPI runtime with your LAMMPS executable. If not then you need to add cuda/aware no to your <arguments>. {: .callout}

Get the full command-line

Derive a command-line to submit a LAMMPS job for the LJ system that you studied for the GPU package such that it invokes the Kokkos GPU to accelerate the job using 2 nodes having 24 cores each, 4 devices per node. Assign all the MPI ranks available on a node to all the devices. Use default package options.
Solution
lmp -k on g 4 -sf kk -pk kokkos -in in.lj
{: .bash} {: .solution} {: .challenge}

A few tips on gaining speedup from Kokkos/GPU

This information is collected from the LAMMPS website

Hardware comptibility: For better performance, you must use Kepler or later generations of GPUs.

MPI tasks per GPU: You should use one MPI task per GPU because Kokkos tries to run everything on the GPU, including the integrator and other fixes/computes. One may get better performance by assigning multiple MPI tasks per GPU if some styles used in the input script have not yet been Kokkos-enabled.

CUDA-aware MPI library: Using this can provide significant performance gain. If this is not available, set it off using the -pk kokkos cuda/aware no switch.

neigh and newton: For Kokkos/GPU, the default is neigh = full and newton = off. For Maxwell and Kepler generations of GPUs, the default settings are typically the best. For Pascal generations, setting neigh = half and newton = on might produce faster runs.

binsize: For many pair styles, setting the value of binsize to twice that used for the CPU styles could offer speedup (and this is the default for the Kokkos/GPU style)

Avoid mixing Kokkos and non-Kokkos styles: In the LAMMPS input file, if you use styles that are not ported to use Kokkos, you may experience a significant loss in performance. This performance penalty occurs because it causes the data to be copied back to the CPU repeatedly. {: .callout}

In the following discussion, we'll work on a few exercises to get familiarized on some of these aspects to some extent.

Exercise: Performance penalty due to use of mixed styles
First, let us take the input for the LJ-system from episode 5. Run this input using all the visible devices in a node available to you and use Kokkos/GPU as the accelerator package using the following settings; CHECK ME!

4 GPUs

Kokkos on

newton off

neigh full

comm device

cude/aware off

Use the number of MPI tasks that equals to the number of devices. Measure the performance of of this run in timesteps/s.
Modify the LJ-input file and append the following lines to the end of the file.
... ... ...
... ... ...
neighbor	0.3 bin
neigh_modify	delay 0 every 20 check no

compute 1 all coord/atom cutoff 2.5
compute 2 all reduce sum c_1
variable acn equal c_2/atoms

fix		1 all nve

thermo 50
thermo_style custom step time  temp press pe ke etotal density v_acn
run		500
{: .source}
Rename the input file and run it using the same Kokkos setting as before, and the identical number of GPU and MPI tasks as previously. Measure the performance of this run in timesteps/s and compare the performance of these two runs and comment on your observations.
Solution

Taking an example from a HPC system with 2x12 cores per node and 2 GPUs (4 visible devices per node), using 1 MPI task per GPU, the following was observed.

First, we ran with Input1. Second, we modified this input as mentioned above (to become Input2) and performance for both of these runs are measured in units of timesteps/s. We can get this information from the log/screen output files. The comparison of performance is given in this table:

Input Performance (timesteps/sec) Performance loss by a factor of

Input 1 (all Kokkos enabled styles used) 8.097

Input 2 (non-Kokkos style used: compute coord/atom) 3.022 2.68

In Input2 we have used styles that is not yet ported to Kokkos. We can check this from the log/screen output files:
(1) pair lj/cut/kk, perpetual
    attributes: full, newton off, kokkos_device
    pair build: full/bin/kk/device
    stencil: full/bin/3d
    bin: kk/device
(2) compute coord/atom, occasional
    attributes: full, newton off
    pair build: full/bin/atomonly
    stencil: full/bin/3d
    bin: standard
{: .output}

In this case, the pair style is Kokkos-enabled (pair lj/cut/kk) while the compute style compute coord/atom is not. Whenever you make such a mix of Kokkos and non-Kokkos styles in the input of a Kokkos run, it costs you dearly since this requires the data to be copied back to the host incurring performance penalty. {: .solution} {: .challenge}

We have already discussed that the primary aim of developing the Kokkos package is to write a single C++ code that will run on both devices (like GPU) and hosts (CPU) with or without multi-threading. Targeting portability without losing the functionality and the performance of a code is the primary objective of Kokkos.

Exercise: Speed-up ( CPU versus GPU package versus Kokkos/GPU )

Let us see now see how the current Kokkos/GPU implementation within LAMMPS (version 3Mar20) achieves this goal by comparing its performance with the CPU and GPU package.

For this, we shall repeat the same set of tasks as described in episode 5. Take a LJ-system with ~11 million atons by choosing x = y = z = 140 and t = 500. We'll Use the optimum number of GPU devices and MPI tasks to run the jobs with Kokkos/GPU with 1 node, then any of 2, 3, 4, 5 nodes (2 sets: one with the GPU package enabled, and the other is the regular MPI-based runs without any accelerator package). For a better comparison of points, choose a different multi-node number to that of your neighbour.

Kokkos/GPU is also specially designed to run everything on the GPUs. We shall offload the entire force computation and neighbour list building to the GPUs using;
-k on g 4 -sf kk -pk kokkos newton off neigh full comm device

{: .bash}

or
-k on g 4 -sf kk -pk kokkos newton off neigh full comm device cuda/aware off
{: .bash}

(if CUDA-aware MPI is not available to you).
Extract the performance data from the log/screen output files from each of these runs. You can do this using the command
grep "Performance:" log.lammps
{: .bash} and note down the performance value in units of timestep/s.
Make a plot to compare the performance of the Kokkos/GPU runs with the CPU runs (i.e. without any accelerator package) and the GPU runs (i.e. with the GPU package enabled) with number of nodes.

Plot the speed-up factor (= GPU performance/CPU performance) versus the number of nodes.

Discuss the main observations from these plots.
Solution

FIXME {: .solution} {: .challenge}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

08-kokkos-gpu.md

08-kokkos-gpu.md

Using GPU acceleration through the Kokkos package

Command-line options to submit a Kokkos GPU job in LAMMPS

Before you start

Get the full command-line

Solution

A few tips on gaining speedup from Kokkos/GPU

Exercise: Performance penalty due to use of mixed styles

Solution

Exercise: Speed-up ( CPU versus GPU package versus Kokkos/GPU )

Solution

Input	Performance (timesteps/sec)	Performance loss by a factor of
Input 1 (all Kokkos enabled styles used)	8.097
Input 2 (non-Kokkos style used: `compute coord/atom`)	3.022	2.68

Files

08-kokkos-gpu.md

Latest commit

History

08-kokkos-gpu.md

File metadata and controls

Using GPU acceleration through the Kokkos package

Command-line options to submit a Kokkos GPU job in LAMMPS

Before you start

Get the full command-line

Solution

A few tips on gaining speedup from Kokkos/GPU

Exercise: Performance penalty due to use of mixed styles

Solution

Exercise: Speed-up ( CPU versus GPU package versus Kokkos/GPU )

Solution