dcgmi-slurm-stats

A code to test using dcgmi to collect GPU statistics in a Slurm cluster

Introduction

We can use the Slurm accounting database (via sacct) to collect the elapsed time of a job and the number of GPUs used (i.e. --format field alloctres). With a little massaging, we can get the allocated GPU time (i.e. number of GPUs * elapsed time).

If you want to get how efficiently the GPUs were used, this is a much harder prospect. To do this, we will use dcgmi to collect the GPU memory and sm information per tasks run on each the GPU.

We follow the instructions given by Job Statistics with Nvidia Data Center GPU Manager and Slurm. and the DCGMI User Guide

Caveats :

dcgmi only reports the efficience of processes that actually run on the GPU. So if for instance, a GPU is allocated for 60min but only has a process on it for 1min, dcgmi might report that the process used the streaming multiprocessor for 95%.
If you follow the above instructions and use CUDA_VISIBLE_DEVICES, you won't always collect the correct GPUs because CUDA_VISIBLE_DEVICES always starts indexing from 0, regardless of the index listed by nvidia-smi. Instead you'd want to use SLURM_JOB_GPUS.
This hasn't been tested with more complicated jobs with multiple steps.

Installation :

You'll want to download the Nvidia SDK for HPC. Be sure you pick a version compatible with the installed version of CUDA on your compute nodes.

i.e. Check the version of CUDA

$ nvidia-smi | grep CUDA

You'll also need to ensure dcgmi is installed i.e. Check that you have it

$ which dcgmi

Compiling

$ ml load nvhpc-hpcx-cuda12/23.9
$ make

Running

The recommended way of running this on a Slurm cluster is to utilize srun to start copies of the parallel process. When I tried calling mpiexec directly, I sometimes wound up with two processes on the same GPU. I got the desired behavior using srun.

# One node exp. (non-interactive)
$ sbatch submit_2task_1node.sh

# Two node exp
$ sbatch submit_2task_2node.sh

DCGMI Notes :

Below is the process for collecting stats from the GPUs

Create a dcgmi group to follow the particular GPUs allocated for your job

# i,j,k are GPU indices enumerated by nvidia-smi 
$ dcgmi group -c some_name -a i,j,k
# Be sure to store the dcgmi group id

Enable process watches
```
$ dcgmi stats -e         
```

Start collecting info on running processes

$ dcgmi stats -g dcgmi_group_id -s some_label

Start your GPU process
After GPU process has ended, stop collection of stats
```
$ dcgmi stats -x some_label
```
Querry the statistics
```
$ dcgmi stats -v -j some_label
```
Delete the group
```
$ dcgmi group -d dcgmi_group_id
```

Observations :

After starting dcgmi stats, even if I inject a sleep 300 command, the Average SM Utilization does not reflect the time the GPU was unused.

MPI / CUDA Notes and Observations :

I am compiling the code (via make) using nvcc on a DGX system. When I call MPI_Finalize(), warnings like.
```
[1741636331.473893] [somehost:123] mpool.c:55   UCX  WARN  object 0x55555b12a040 was not returned to mpool CUDA EVENT objects
```
UCX seems to be some middleware running on the IB network card. I think I need to do a deeper dive on this to understand what is going on here. I don't think did not observe this behavior before with multi-GPU MPI jobs so I think this is very hardware specific.
Because I'm calling my MPI code via srun rather than mpirun, I found it challenging to adjust the runtime behavior using MPI mca modules. According to the OpenMPI documentation, You can control the mca runtime modules used by mpirun by modifing environmental variables. E.g.
```
export OMPI_MCA_mpi_common_cuda_verbose=10
export OMPI_MCA_pml_ucx_verbose=3
```
As shown above, you can control the verbose diagnostics output by both UCX and CUDA.
At some point I was getting segmentation faults a) They all occur near or during MPI_Finalize(). b) I think some of them might have been related to a having conflicting nvhpc modules loaded. c) Some of them were due to me trying to free CUDA managed memory. d) I think these may be related to the UCX WARN above.

Questions

Q : dcgmi stats -e do I need to specify group?
Q : Can a normal user stop dcgmi stats on group -g 0?

TO DO

Work on understanding the UCX warnings and mitigating them
Create a python script to parse verbose dcgmi stats to have more concise GPU details.
Work on understanding using MPI, CUDA, NVLink and Infiniband. It seems quite hardware dependant. Useful resources

a) UCX Programming Guide

b) OpenUCX Read The Docs

c) OpenMPI - Modular Component Architecture

d) OpenMPI - Infiniband / RoCE Support

e) CUDA Unified Memory

f) Mixing MPI and CUDA

g) OpenMPI - CUDA

h) OSU : MPI over IB examples

i) CUDA C++ Best Practices Guide

j) Lawrence Livermore's Awesome MPI Documentation

Name	Name	Last commit message	Last commit date
Latest commit astrophys README.md : Mar 13, 2025 d7fc26e · Mar 13, 2025 History 14 Commits
src/mpi	src/mpi	At this point, I think I have a very good understanding of how to use…	Mar 13, 2025
.gitignore	.gitignore	README.md :	Mar 12, 2025
LICENSE	LICENSE	Initial commit	Mar 3, 2025
Makefile	Makefile	I got my mpi_matrix_mult to work on an Nvidia POD.	Mar 5, 2025
README.md	README.md	README.md :	Mar 13, 2025
submit_2task_1node.sh	submit_2task_1node.sh	At this point, I think I have a very good understanding of how to use…	Mar 13, 2025
submit_2task_2node.sh	submit_2task_2node.sh	README.md :	Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dcgmi-slurm-stats

Introduction

Installation :

Compiling

Running

DCGMI Notes :

Observations :

MPI / CUDA Notes and Observations :

Questions

TO DO

References :

About

Releases

Packages

Languages

License

astrophys/dcgmi-slurm-stats

Folders and files

Latest commit

History

Repository files navigation

dcgmi-slurm-stats

Introduction

Installation :

Compiling

Running

DCGMI Notes :

Observations :

MPI / CUDA Notes and Observations :

Questions

TO DO

References :

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages