Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run nsys or CUPTI profiling on K8 cluster with gpu-operator #1158

Open
manepallirajesh opened this issue Dec 9, 2024 · 4 comments

Comments

@manepallirajesh
Copy link

I could not run gpu profiling on a K8 cluster with gpu-operator.

tested the following:

  1. node without K8, only docker flow (same HW + nvidia drivers + containers) - running fine
  2. ode with K8 + gpu-operator, get this issue
    Image

Reproducible steps:

  1. Deploy ngc compatible container
  2. enter into container (terminal)
  3. cd /usr/local/cuda/extras/CUPTI/sample/callback_profiling
  4. make
  5. ./callback_profiling or nsys profile callback_profiling
@avnf
Copy link

avnf commented Dec 30, 2024

Hello, can you please check output of this command
cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
both on node without k8 and inside k8+gpu-operator? Which container did you use?

@manepallirajesh
Copy link
Author

Thanks for the response.

Where not working on K8 cluster + gpu-operator:
Image

where working on bare node with driver:
Image

I am using nvcr.io/nvidia/tensorflow:24.11-tf2-py3

@avnf
Copy link

avnf commented Dec 30, 2024

This is likely the cause of the issue. There is kernel module parameter NVreg_RestrictProfilingToAdminUsers which is documented as set to "not restrict" by default in the kernel driver https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550/kernel-open/nvidia/nv-reg.h#L526

Option: RestrictProfilingToAdminUsers
Description:

When this option is enabled, the NVIDIA kernel module will prevent users
without administrative access (i.e., the CAP_SYS_ADMIN capability) from
using GPU performance counters.

Possible Values:

0: Do not restrict GPU counters (default)
1: Restrict GPU counters to system administrators only

I can't find any mention of this argument in gpu-operator: for "RestrictProfilingToAdminUsers" https://github.com/search?q=org%3ANVIDIA+RestrictProfiling&type=code or for "RmProfilingAdminOnly" https://github.com/search?q=org%3ANVIDIA+ProfilingAdmin&type=code

Update: there is a document https://download.nvidia.com/XFree86/Linux-x86_64/550.67/README/knownissues.html which claims that by default access is limited to root user:

By default, access to the GPU performance counters is restricted to root, and other users with the CAP_SYS_ADMIN capability, for security reasons. If developers require access to the NVIDIA Developer Tools, a system administrator can accept the security risk and allow access to users without the CAP_SYS_ADMIN capability.
Wider access to GPU performance counters can be granted by setting the kernel module parameter "NVreg_RestrictProfilingToAdminUsers=0" in the nvidia.ko kernel module.

@avnf
Copy link

avnf commented Dec 30, 2024

There is support for custom kernel module parameters with kernelModuleConfig ConfigMap. Could you please try this instruction
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/custom-driver-params.html
and set "NVreg_RestrictProfilingToAdminUsers=0"

$ cat nvidia.conf
NVreg_RestrictProfilingToAdminUsers=0
$ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia.conf=./nvidia.conf
$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --version=v24.9.1 \
     --set driver.kernelModuleConfig.name="kernel-module-params"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants