Skip to content

A code to test using dcgmi to collect GPU statistics in a Slurm cluster

License

Notifications You must be signed in to change notification settings

astrophys/dcgmi-slurm-stats

Repository files navigation

dcgmi-slurm-stats

A code to test using dcgmi to collect GPU statistics in a Slurm cluster

Introduction

We can use the Slurm accounting database (via sacct) to collect the elapsed time of a job and the number of GPUs used (i.e. --format field alloctres). With a little massaging, we can get the allocated GPU time (i.e. number of GPUs * elapsed time).

If you want to get how efficiently the GPUs were used, this is a much harder prospect. To do this, we will use dcgmi to collect the GPU memory and sm information per tasks run on each the GPU.

We follow the instructions given by Job Statistics with Nvidia Data Center GPU Manager and Slurm. and the DCGMI User Guide

Caveats :

  1. dcgmi only reports the efficience of processes that actually run on the GPU. So if for instance, a GPU is allocated for 60min but only has a process on it for 1min, dcgmi might report that the process used the streaming multiprocessor for 95%.
  2. If you follow the above instructions and use CUDA_VISIBLE_DEVICES, you won't always collect the correct GPUs because CUDA_VISIBLE_DEVICES always starts indexing from 0, regardless of the index listed by nvidia-smi. Instead you'd want to use SLURM_JOB_GPUS.
  3. This hasn't been tested with more complicated jobs with multiple steps.

Installation :

You'll want to download the Nvidia SDK for HPC. Be sure you pick a version compatible with the installed version of CUDA on your compute nodes.

i.e. Check the version of CUDA

$ nvidia-smi | grep CUDA

You'll also need to ensure dcgmi is installed i.e. Check that you have it

$ which dcgmi

Compiling

$ ml load nvhpc-hpcx-cuda12/23.9
$ make

Running

The recommended way of running this on a Slurm cluster is to utilize srun to start copies of the parallel process. When I tried calling mpiexec directly, I sometimes wound up with two processes on the same GPU. I got the desired behavior using srun.

# One node exp. (non-interactive)
$ sbatch submit_2task_1node.sh

# Two node exp
$ sbatch submit_2task_2node.sh

DCGMI Notes :

Below is the process for collecting stats from the GPUs

  1. Create a dcgmi group to follow the particular GPUs allocated for your job

    # i,j,k are GPU indices enumerated by nvidia-smi 
    $ dcgmi group -c some_name -a i,j,k
    # Be sure to store the dcgmi group id
    
  2. Enable process watches

    $ dcgmi stats -e         
    
  3. Start collecting info on running processes

    $ dcgmi stats -g dcgmi_group_id -s some_label
    
  4. Start your GPU process

  5. After GPU process has ended, stop collection of stats

    $ dcgmi stats -x some_label
    
  6. Querry the statistics

    $ dcgmi stats -v -j some_label
    
  7. Delete the group

    $ dcgmi group -d dcgmi_group_id
    

Observations :

  1. After starting dcgmi stats, even if I inject a sleep 300 command, the Average SM Utilization does not reflect the time the GPU was unused.

MPI / CUDA Notes and Observations :

  1. I am compiling the code (via make) using nvcc on a DGX system. When I call MPI_Finalize(), warnings like.

    [1741636331.473893] [somehost:123] mpool.c:55   UCX  WARN  object 0x55555b12a040 was not returned to mpool CUDA EVENT objects
    

    UCX seems to be some middleware running on the IB network card. I think I need to do a deeper dive on this to understand what is going on here. I don't think did not observe this behavior before with multi-GPU MPI jobs so I think this is very hardware specific.

  2. Because I'm calling my MPI code via srun rather than mpirun, I found it challenging to adjust the runtime behavior using MPI mca modules. According to the OpenMPI documentation, You can control the mca runtime modules used by mpirun by modifing environmental variables. E.g.

    export OMPI_MCA_mpi_common_cuda_verbose=10
    export OMPI_MCA_pml_ucx_verbose=3
    
  3. As shown above, you can control the verbose diagnostics output by both UCX and CUDA.

  4. At some point I was getting segmentation faults a) They all occur near or during MPI_Finalize(). b) I think some of them might have been related to a having conflicting nvhpc modules loaded. c) Some of them were due to me trying to free CUDA managed memory. d) I think these may be related to the UCX WARN above.

Questions

  1. Q : dcgmi stats -e do I need to specify group?
  2. Q : Can a normal user stop dcgmi stats on group -g 0?

TO DO

  1. Work on understanding the UCX warnings and mitigating them

  2. Create a python script to parse verbose dcgmi stats to have more concise GPU details.

  3. Work on understanding using MPI, CUDA, NVLink and Infiniband. It seems quite hardware dependant. Useful resources

    a) UCX Programming Guide

    b) OpenUCX Read The Docs

    c) OpenMPI - Modular Component Architecture

    d) OpenMPI - Infiniband / RoCE Support

    e) CUDA Unified Memory

    f) Mixing MPI and CUDA

    g) OpenMPI - CUDA

    h) OSU : MPI over IB examples

    i) CUDA C++ Best Practices Guide

    j) Lawrence Livermore's Awesome MPI Documentation

References :

  1. Job Statistics with Nvidia Data Center GPU Manager and Slurm
  2. cuda-gdb with mpi
  3. Bluefield and DOCA Programming Guides
  4. OpenMPI 4.1 README
  5. Open UCX with MPI
  6. CUDA Unified Memory
  7. OSU Benchmarks - use for inspiration
  8. DCGMI Documentation
  9. OpenMPI MCA documentation
  10. DCGM Documentation
  11. Lawrence Livermore's Awesome MPI Documentation

About

A code to test using dcgmi to collect GPU statistics in a Slurm cluster

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published