Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segmentation Fault when one GPU lost from PCIe bus #139

Closed
3 tasks done
Junyi-99 opened this issue Nov 19, 2024 · 1 comment · Fixed by #146
Closed
3 tasks done

[BUG] Segmentation Fault when one GPU lost from PCIe bus #139

Junyi-99 opened this issue Nov 19, 2024 · 1 comment · Fixed by #146
Assignees
Labels
api Something related to the core APIs bug Something isn't working pynvml Something related to the `nvidia-ml-py` package

Comments

@Junyi-99
Copy link

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.2

Operating system and version

Ubuntu 22.04

NVIDIA driver version

560.35.03

NVIDIA-SMI

$ nvidia-smi -i 0
Tue Nov 19 14:59:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:1D:00.0 Off |                  N/A |
| 30%   28C    P8             31W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Python environment

$ python3 -m pip freeze | python3 -c 'import sys; print(sys.version, sys.platform); print("".join(filter(lambda s: any(word in s.lower() for word in ("nvi", "cuda", "nvml", "gpu")), sys.stdin)))'
3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux
gpustat==1.1.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.535.108
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
nvitop==1.3.2

Problem description

The nvitop exits with a segmentation fault when one of the gpu is lost from the bus.

First of all, this is not a problem with nvitop itself.

I encountered this issue and would like to suggest that nvitop should still be able to display other GPUs even when one GPU is faulty, instead of resulting in a segmentation fault.

It would be nice if nvitop could skip the faulty GPU. (like gpustat)

Steps to Reproduce

  1. Unplug the gpu from the pcie bus. (don't know how to do that..)
  2. nvitop

Traceback

nvitop[2398212]: segfault at 0 ip 00007f2113c7128b sp 00007ffc6e223820 error 4 in libnvidia-ml.so.560.35.03[7f2113c00000+1d3000]

Logs

No response

Expected behavior

It would be nice if nvitop could skip the faulty GPU.

For example gpustat can show the faulty GPU:

image

Additional context

No response

@XuehaiPan
Copy link
Owner

Sorry for the late response. You can try:

pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop

@XuehaiPan XuehaiPan added pynvml Something related to the `nvidia-ml-py` package api Something related to the core APIs labels Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs bug Something isn't working pynvml Something related to the `nvidia-ml-py` package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants