A code to test using dcgmi to collect GPU statistics in a Slurm cluster
We can use the Slurm accounting database (via sacct
) to collect the elapsed time of a
job and the number of GPUs used (i.e. --format
field alloctres
). With a little
massaging, we can get the allocated GPU time (i.e. number of GPUs * elapsed time).
If you want to get how efficiently the GPUs were used, this is a much harder prospect.
To do this, we will use dcgmi
to collect the GPU memory and sm information per tasks
run on each the GPU.
We follow the instructions given by Job Statistics with Nvidia Data Center GPU Manager and Slurm. and the DCGMI User Guide
Caveats :
dcgmi
only reports the efficience of processes that actually run on the GPU. So if for instance, a GPU is allocated for 60min but only has a process on it for 1min,dcgmi
might report that the process used the streaming multiprocessor for 95%.- If you follow the above instructions and use
CUDA_VISIBLE_DEVICES
, you won't always collect the correct GPUs becauseCUDA_VISIBLE_DEVICES
always starts indexing from 0, regardless of the index listed bynvidia-smi
. Instead you'd want to useSLURM_JOB_GPUS
. - This hasn't been tested with more complicated jobs with multiple steps.
You'll want to download the Nvidia SDK for HPC. Be sure you pick a version compatible with the installed version of CUDA on your compute nodes.
i.e. Check the version of CUDA
$ nvidia-smi | grep CUDA
You'll also need to ensure dcgmi
is installed
i.e. Check that you have it
$ which dcgmi
$ ml load nvhpc-hpcx-cuda12/23.9
$ make
The recommended way of running this on a Slurm cluster is to utilize srun
to start copies
of the parallel process. When I tried calling mpiexec
directly, I sometimes wound up with
two processes on the same GPU. I got the desired behavior using srun
.
# One node exp. (non-interactive)
$ sbatch submit_2task_1node.sh
# Two node exp
$ sbatch submit_2task_2node.sh
Below is the process for collecting stats from the GPUs
-
Create a
dcgmi
group to follow the particular GPUs allocated for your job# i,j,k are GPU indices enumerated by nvidia-smi $ dcgmi group -c some_name -a i,j,k # Be sure to store the dcgmi group id
-
Enable process watches
$ dcgmi stats -e
-
Start collecting info on running processes
$ dcgmi stats -g dcgmi_group_id -s some_label
-
Start your GPU process
-
After GPU process has ended, stop collection of stats
$ dcgmi stats -x some_label
-
Querry the statistics
$ dcgmi stats -v -j some_label
-
Delete the group
$ dcgmi group -d dcgmi_group_id
- After starting
dcgmi stats
, even if I inject asleep 300
command, the Average SM Utilization does not reflect the time the GPU was unused.
-
I am compiling the code (via
make
) usingnvcc
on a DGX system. When I callMPI_Finalize()
, warnings like.[1741636331.473893] [somehost:123] mpool.c:55 UCX WARN object 0x55555b12a040 was not returned to mpool CUDA EVENT objects
UCX seems to be some middleware running on the IB network card. I think I need to do a deeper dive on this to understand what is going on here. I don't think did not observe this behavior before with multi-GPU MPI jobs so I think this is very hardware specific.
-
Because I'm calling my MPI code via
srun
rather thanmpirun
, I found it challenging to adjust the runtime behavior using MPImca
modules. According to the OpenMPI documentation, You can control themca
runtime modules used bympirun
by modifing environmental variables. E.g.export OMPI_MCA_mpi_common_cuda_verbose=10 export OMPI_MCA_pml_ucx_verbose=3
-
As shown above, you can control the verbose diagnostics output by both UCX and CUDA.
-
At some point I was getting segmentation faults a) They all occur near or during
MPI_Finalize()
. b) I think some of them might have been related to a having conflicting nvhpc modules loaded. c) Some of them were due to me trying to free CUDA managed memory. d) I think these may be related to the UCX WARN above.
- Q :
dcgmi stats -e
do I need to specify group? - Q : Can a normal user stop dcgmi stats on group -g 0?
-
Work on understanding the UCX warnings and mitigating them
-
Create a python script to parse verbose
dcgmi stats
to have more concise GPU details. -
Work on understanding using MPI, CUDA, NVLink and Infiniband. It seems quite hardware dependant. Useful resources
c) OpenMPI - Modular Component Architecture
d) OpenMPI - Infiniband / RoCE Support
- Job Statistics with Nvidia Data Center GPU Manager and Slurm
- cuda-gdb with mpi
- Bluefield and DOCA Programming Guides
- OpenMPI 4.1 README
- Open UCX with MPI
- CUDA Unified Memory
- OSU Benchmarks - use for inspiration
- DCGMI Documentation
- OpenMPI MCA documentation
- DCGM Documentation
- Lawrence Livermore's Awesome MPI Documentation