Skip to content

Commit 69f9c2e

Browse files
ax3ldpgrote
authored andcommitted
Docs: New OLCF Machine (BLAST-WarpX#3228)
1 parent d11c93d commit 69f9c2e

File tree

4 files changed

+202
-0
lines changed

4 files changed

+202
-0
lines changed

Docs/source/install/hpc.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ HPC Systems
2929
hpc/summit
3030
hpc/spock
3131
hpc/crusher
32+
hpc/frontier
3233
hpc/juwels
3334
hpc/lassen
3435
hpc/quartz
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
.. _building-frontier:
2+
3+
Frontier (OLCF)
4+
===============
5+
6+
The `Frontier cluster (see: Crusher) <https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html>`_ is located at OLCF.
7+
Each node contains 4 AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node.
8+
You can think of the 8 GCDs as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).
9+
10+
If you are new to this system, please see the following resources:
11+
12+
* `Crusher user guide <https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html>`_
13+
* Batch system: `Slurm <https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#running-jobs>`_
14+
* `Production directories <https://docs.olcf.ornl.gov/data/storage_overview.html>`_:
15+
16+
* ``$PROJWORK/$proj/``: shared with all members of a project (recommended)
17+
* ``$MEMBERWORK/$proj/``: single user (usually smaller quota)
18+
* ``$WORLDWORK/$proj/``: shared with all users
19+
* Note that the ``$HOME`` directory is mounted as read-only on compute nodes.
20+
That means you cannot run in your ``$HOME``.
21+
22+
23+
Installation
24+
------------
25+
26+
Use the following commands to download the WarpX source code and switch to the correct branch.
27+
**You have to do this on Summit/OLCF Home/etc. since Frontier cannot connect directly to the internet**:
28+
29+
.. code-block:: bash
30+
31+
git clone https://github.com/ECP-WarpX/WarpX.git $HOME/src/warpx
32+
git clone https://github.com/AMReX-Codes/amrex.git $HOME/src/amrex
33+
git clone https://github.com/ECP-WarpX/picsar.git $HOME/src/picsar
34+
git clone -b 0.14.5 https://github.com/openPMD/openPMD-api.git $HOME/src/openPMD-api
35+
36+
To enable HDF5, work-around the broken ``HDF5_VERSION`` variable (empty) in the Cray PE by commenting out the following lines in ``$HOME/src/openPMD-api/CMakeLists.txt``:
37+
https://github.com/openPMD/openPMD-api/blob/0.14.5/CMakeLists.txt#L216-L220
38+
39+
We use the following modules and environments on the system (``$HOME/frontier_warpx.profile``).
40+
41+
.. literalinclude:: ../../../../Tools/machines/frontier-olcf/frontier_warpx.profile.example
42+
:language: bash
43+
:caption: You can copy this file from ``Tools/machines/frontier-olcf/frontier_warpx.profile.example``.
44+
45+
We recommend to store the above lines in a file, such as ``$HOME/frontier_warpx.profile``, and load it into your shell after a login:
46+
47+
.. code-block:: bash
48+
49+
source $HOME/frontier_warpx.profile
50+
51+
52+
Then, ``cd`` into the directory ``$HOME/src/warpx`` and use the following commands to compile:
53+
54+
.. code-block:: bash
55+
56+
cd $HOME/src/warpx
57+
rm -rf build
58+
59+
cmake -S . -B build \
60+
-DWarpX_COMPUTE=HIP \
61+
-DWarpX_amrex_src=$HOME/src/amrex \
62+
-DWarpX_picsar_src=$HOME/src/picsar \
63+
-DWarpX_openpmd_src=$HOME/src/openPMD-api
64+
cmake --build build -j 32
65+
66+
The general :ref:`cmake compile-time options <building-cmake>` apply as usual.
67+
68+
69+
.. _running-cpp-frontier:
70+
71+
Running
72+
-------
73+
74+
.. _running-cpp-frontier-MI100-GPUs:
75+
76+
MI250X GPUs (2x64 GB)
77+
^^^^^^^^^^^^^^^^^^^^^
78+
79+
After requesting an interactive node with the ``getNode`` alias above, run a simulation like this, here using 8 MPI ranks and a single node:
80+
81+
.. code-block:: bash
82+
83+
runNode ./warpx inputs
84+
85+
Or in non-interactive runs:
86+
87+
.. literalinclude:: ../../../../Tools/machines/frontier-olcf/submit.sh
88+
:language: bash
89+
:caption: You can copy this file from ``Tools/machines/frontier-olcf/submit.sh``.
90+
91+
92+
.. _post-processing-frontier:
93+
94+
Post-Processing
95+
---------------
96+
97+
For post-processing, most users use Python via OLCFs's `Jupyter service <https://jupyter.olcf.ornl.gov>`__ (`Docs <https://docs.olcf.ornl.gov/services_and_applications/jupyter/index.html>`__).
98+
99+
Please follow the same guidance as for :ref:`OLCF Summit post-processing <post-processing-summit>`.
100+
101+
.. _known-frontier-issues:
102+
103+
Known System Issues
104+
-------------------
105+
106+
.. warning::
107+
108+
May 16th, 2022 (OLCFHELP-6888):
109+
There is a caching bug in Libfrabric that causes WarpX simulations to occasionally hang on Crusher on more than 1 node.
110+
111+
As a work-around, please export the following environment variable in your job scripts unti the issue is fixed:
112+
113+
.. code-block:: bash
114+
115+
export FI_MR_CACHE_MAX_COUNT=0 # libfabric disable caching
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# please set your project account
2+
#export proj=APH114-frontier
3+
4+
# required dependencies
5+
module load cmake/3.22.2
6+
module load craype-accel-amd-gfx90a
7+
module load rocm/5.1.0
8+
module load cray-mpich
9+
module load cce/14.0.1 # must be loaded after rocm
10+
11+
# optional: faster builds
12+
module load ccache
13+
module load ninja
14+
15+
# optional: just an additional text editor
16+
module load nano
17+
18+
# optional: for PSATD in RZ geometry support (not yet available)
19+
#module load cray-libsci_acc/22.06.1.2
20+
#module load blaspp
21+
#module load lapackpp
22+
23+
# optional: for QED lookup table generation support
24+
module load boost/1.79.0-cxx17
25+
26+
# optional: for openPMD support
27+
#module load adios2/2.7.1
28+
module load cray-hdf5-parallel/1.12.1.1
29+
30+
# optional: for Python bindings or libEnsemble
31+
module load cray-python/3.9.12.1
32+
33+
# fix system defaults: do not escape $ with a \ on tab completion
34+
shopt -s direxpand
35+
36+
# make output group-readable by default
37+
umask 0027
38+
39+
# an alias to request an interactive batch node for one hour
40+
# for paralle execution, start on the batch node: srun <command>
41+
alias getNode="salloc -A $proj -J warpx -t 01:00:00 -p batch -N 1 --ntasks-per-node=8 --gpus-per-task=1 --gpu-bind=closest"
42+
# an alias to run a command on a batch node for up to 30min
43+
# usage: runNode <command>
44+
alias runNode="srun -A $proj -J warpx -t 00:30:00 -p batch -N 1 --ntasks-per-node=8 --gpus-per-task=1 --gpu-bind=closest"
45+
46+
# GPU-aware MPI
47+
export MPICH_GPU_SUPPORT_ENABLED=1
48+
49+
# optimize CUDA compilation for MI250X
50+
export AMREX_AMD_ARCH=gfx90a
51+
52+
# compiler environment hints
53+
export CC=$(which cc)
54+
export CXX=$(which CC)
55+
export FC=$(which ftn)
56+
export CFLAGS="-I${ROCM_PATH}/include"
57+
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
58+
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64"
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/usr/bin/env bash
2+
3+
#SBATCH -A <project id>
4+
#SBATCH -J warpx
5+
#SBATCH -o %x-%j.out
6+
#SBATCH -t 00:10:00
7+
#SBATCH -p batch
8+
#SBATCH --ntasks-per-node=8
9+
#SBATCH --cpus-per-task=8
10+
#SBATCH --gpus-per-task=1
11+
#SBATCH --gpu-bind=closest
12+
#SBATCH -N 1
13+
14+
# From the documentation:
15+
# Each Frontier compute node consists of [1x] 64-core AMD EPYC 7A53
16+
# "Optimized 3rd Gen EPYC" CPU (with 2 hardware threads per physical core) with
17+
# access to 512 GB of DDR4 memory.
18+
# Each node also contains [4x] AMD MI250X, each with 2 Graphics Compute Dies
19+
# (GCDs) for a total of 8 GCDs per node. The programmer can think of the 8 GCDs
20+
# as 8 separate GPUs, each having 64 GB of high-bandwidth memory (HBM2E).
21+
22+
# note (5-16-22 and 7-12-22)
23+
# this environment setting is currently needed on Frontier to work-around a
24+
# known issue with Libfabric (both in the May and June PE)
25+
export FI_MR_CACHE_MAX_COUNT=0
26+
27+
export OMP_NUM_THREADS=8
28+
srun ./warpx inputs > output.txt

0 commit comments

Comments
 (0)