title | teaching | exercises | questions | objectives | keypoints | |||||
---|---|---|---|---|---|---|---|---|---|---|
Kokkos with OpenMP |
30 |
15 |
|
|
|
In this episode, we'll be learning to use Kokkos package with OpenMP execution for multi-core CPUs. First we'll get familiarized with the command-line options to run a Kokkos OpenMP job in LAMMPS. This will be followed by a case study to gain some hands-on experience to use this package. For the hands-on part, we'll take the same rhodopsin system which we studied in previous episodes. We shall use the same input file ADD LINK to in.rhodo and repeat similar scalability studies for the mixed MPI/OpenMP settings as we did it for the USER-OMP package.
- Know your hardware: get the number of physical cores per node available to you. Take care such that
(number of MPI tasks) * (OpenMP threads per task) <= (total number of physical cores per node)
- Check for hyperthreading: Sometimes a CPU splits its each physical cores into multiple virtual cores. Intel's term for this is hyperthreads (HT). When hyperthreading is enabled, each physical core appears as (usually) two logical CPU units to the OS and thus allows these logical cores to share the physical execution space. This may result in a 'slight' performance gain. So, a node with 24 physical cores appears as 48 logical cores to the OS if HT is enabled. In this case,
(number of MPI tasks) * (OpenMP threads per task) <= (total number of virtual cores per node)
- CPU affinity: CPU affinity decides whether a thread running on a particular core is allowed to migrate to another core (if the operating system thinks that is a good idea). You can set CPU affinity masks to limit the set of cores that the thread can migrate to, for example you usually do not want your thread to migrate to another socket since that could mean that it is far away from the data it needs to process and could introduce a lot of delay in fetching and writing data.
- Set OpenMP Environment variables:
OMP_NUM_THREADS
,OMP_PROC_BIND
,OMP_PLACES
are the ones we will touch here. {: .callout}
In this episode, we'll learn to use Kokkos package with OpenMP for multi-core CPUs. To run the Kokkos package, the following three command-line switches are very important:
-k on
: This enables Kokkos at runtime-sf kk
: This appends the "/kk" suffix to Kokkos-supported LAMMPS styles-pk kokkos
: This is used to modify the default package Kokkos options
To invoke the OpenMP execution mode with Kokkos, we need an additional command-line
switch just following the -k on
switch as shown below:
4. -k on t Nt
: Using this switch you can specify the number of OpenMP threads, Nt
,
that you want to use per node. You should also set a proper value for your OpenMP
environment variables. You can do this with
export OMP_NUM_THREADS=4
{: .bash}
if you like to use 4 threads per node (`Nt` is 4). You should also set some other
environment variables to help with thread placement. For best performance with
OpenMP 4.0 or later set;
```
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
```
{: .bash}
Derive a command-line to submit a LAMMPS job for the rhodopsin system such that it invokes the Kokkos OpenMP threading to accelerate the job using 2 nodes having 40 cores each, 4 MPI ranks per nodes, 10 OpenMP threads per rank with default package options.
export OMP_NUM_THREADS=10 export OMP_PROC_BIND=spread export OMP_PLACES=threads mpirun -np 8 -ppn 4 --bind-to socket --map-by socket lmp -k on t $OMP_NUM_THREADS -sf kk -i in.rhodo
{: .bash}
This solution includes affinity using OpenMPI MPI runtime binding mechanisms
--bind-to socket --map-by socket
which ensures that OpenMP threads cannot move between sockets (but how to set this is dependent on the MPI runtime used).OMP_PROC_BIND
andOMP_PLACES
influence what happens to the OpenMP threads on each socket. {: .solution} {: .challenge}
There is some more work to do before we can jump into a thorough scalability study when
we use OpenMP in Kokkos which comes with a few extra package
arguments and
corresponding keywords (see the previous episode for
a list of all options) as compared to that offered by the USER-OMP package. These
are neigh
, newton
, comm
and binsize
. The first thing that we need to do here is to
find what values of these keywords offer the fastest runs. Once we know the optimum
settings, we can use them for all the runs needed to perform the scalability studies.
In the above, we showed a command-line example to submit a LAMMPS job with default package setting for the Kokkos OpenMP run. But, often the default
package
setting may not provide the fastest runs. Before jumping to production runs, we need to check for optimum settings for these values to avoid wastage of our time and valuable computing resources. In the very next section, we'll be showing how to do this with rhodopsin example. Before that, here is an example of command-line which shows how thesepackage
related keywords can be invoked in your LAMMPS runs using the command-line switches. Defaultpackage
settings are overwritten here using-pk kokkos neigh half newton on comm no
.export OMP_NUM_THREADS=10 export OMP_PROC_BIND=spread export OMP_PLACES=threads mpirun -np 8 -ppn 4 --bind-to socket --map-by socket lmp -k on t $OMP_NUM_THREADS -sf kk -pk kokkos neigh half newton on comm no -i in.rhodo
{: .bash} {: .callout}
Take the rhodopsin input files (
in.rhodo
anddata.rhodo
SHOULD BE LINKED ), and run LAMMPS jobs for40 MPI/1 OpenMP
thread on 1 node using thepackage
command for the following two set of parameters.
neigh full newton off comm no
neigh half newton on comm host
What is the influence on
comm
? What is implied in the output file?What difference does switching the values of
neigh
andnewton
have? Why?For a HPC setup which has 40 cores per node, the runtimes for all the MPI/OpenMP combinations and combination of keywords is given below:
neigh newton comm binsize 1MPI/40t 2MPI/40t 4MPI/10t 5MPI/8t 8MPI/5t 10MPI/4t 20MPI/2t 40MPI/1t full off no default 172 139 123 125 120 117 116 118 full off host default 172 139 123 125 120 117 116 118 full off dev default 172 139 123 125 120 117 116 119 full on no default 176 145 125 128 120 119 116 118 half on no default 190 135 112 119 103 102 97 94
The influence on
comm
can be seen in the output file, as it prints the following;WARNING: Fixes cannot yet send data in Kokkos communication, switching to classic communication (src/KOKKOS/comm_kokkos.cpp:493)
{: .output}
This means the fixes that we are using in this calculation are not yet supported in Kokkos communication and hence using different values of the
comm
keyword makes no difference.Switching on
newton
and usinghalf
neighbour list make the runs faster for most of the MPI/OpenMP settings. Whenhalf
neighbour list and OpenMP is being used together in Kokkos, it uses data duplication to make it thread-safe. When you use relatively few numbers of threads (8 or less) this could be fastest and for more threads it becomes memory-bound (since there are more copies of the same data filling up RAM) and suffers from poor scalability with increasing thread-counts. If you look at the data in the above table carefully, you will notice that using 40 OpenMP threads forneigh
=half
andnewton
=on
makes the run slower. On the other hand, when you use only 1 OpenMP thread per MPI rank, it requires no data duplication or atomic operations, hence it produces the fastest run.So, we'll be using
neigh half newton on comm host
) for all the runs in the scalability studies below. {: .solution} {: .challenge}
As before, doing a scalability study would be a time consuming undertaking, so lets take an example on nodes with 2x20 cores, as we did in an exercise a few episodes ago, where a total of 80 calculations would be needed for the 10 nodes.
The results from this study can be found in the csv file (INCLUDE LINK). Using the parallel_eff.py (INCLUDE LINK), make a plot of parallel efficiency vs number of nodes. The code will calculate parallel efficiency for you.
Compare this plot with the plot you generated in a previous exercise. Write down your observations and make comments on any performance enhancement when you compare these results with the pure MPI runs.
Consider your own HPC system. How would a similar study look on your own system?
FIX SCALES
Consider this plot of a full scalability study, comparing it with that seen in a previous exercise, from which you can take the following observations.
Data for the pure MPI-based run is plotted with the thick blue line. Strikingly, none of the Kokkos based MPI/OpenMP mixed runs show comparable parallel performance with the pure MPI-based approach. The difference in parallel efficiency is more pronounced for less node counts and this gap in performance reduces slowly as we increase the number of nodes to run the job. This indicates that to see comparable performance with the pure MPI-based runs we need to increase the number of nodes far beyond than what is used in the current study.
If we now compare the performance of Kokkos OpenMP with the threading implemented with the USER-OMP package, there is quite a bit of difference.
This difference could be due to vectorization. Currently (version
7Aug19
or3Mar20
) the Kokkos package in LAMMPS doesn't vectorize well as compared to the vectorization implemented in the USER-OMP package. USER-INTEL should be even better than USER-OMP at vectorizing if the styles are supported in that package.The 'deceleration' is probably due to Kokkos and OpenMP overheads to make the kernels thread-safe.
If we just compare the performance among the Kokkos OpenMP runs, we see that parallel efficiency values are converging even for more thread-counts (1 to 20) as we increase the number of nodes. This is indicative that Kokkos OpenMP scales better with increasing thread counts as compared to the USER-OMP package. {: .solution} {: .challenge}