Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU Unknown Error 会导致 as_snapshot() 调用触发段错误 #145

Closed
3 tasks done
jue-jue-zi opened this issue Jan 13, 2025 · 3 comments · Fixed by #146
Closed
3 tasks done

[BUG] GPU Unknown Error 会导致 as_snapshot() 调用触发段错误 #145

jue-jue-zi opened this issue Jan 13, 2025 · 3 comments · Fixed by #146
Assignees
Labels
api Something related to the core APIs bug Something isn't working pynvml Something related to the `nvidia-ml-py` package

Comments

@jue-jue-zi
Copy link

jue-jue-zi commented Jan 13, 2025

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.4.0

Operating system and version

Ubuntu 24.04 LTS

NVIDIA driver version

565.57.01

NVIDIA-SMI

nvidia-smi
Unable to determine the device handle for GPU0000:1B:00.0: Unknown Error

nvidia-smi -i 0,1,2
Mon Jan 13 16:25:54 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:03:00.0 Off |                  N/A |
|  0%   33C    P8              9W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:0B:00.0 Off |                  N/A |
|  0%   34C    P8             15W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:0C:00.0 Off |                  N/A |
|  0%   29C    P8             10W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Python environment

3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] linux
nvidia-ml-py==12.535.161
nvitop==1.4.0

Problem description

GPU Unknown Error 会导致 as_snapshot() 调用触发段错误

Steps to Reproduce

root@vm:/usr/local/lib/python3.12/dist-packages/nvitop# python
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nvitop.gui import Device
>>> devices = Device.from_indices([3,])
>>> devices[0]
Device(index=3, name='ERROR: Unknown', total_memory=N/A)
>>> devices[0].as_snapshot()
段错误

更简单的复现方案:

root@vm:/usr/local/lib/python3.12/dist-packages/nvitop# python
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nvitop.api import Device
>>> device = Device(0)
>>> device.as_snapshot()
PhysicalDeviceSnapshot(...)
>>> device._handle = None
>>> device.as_snapshot()
段错误

Traceback

No response

Logs

LOGLEVEL=DEBUG nvitop >nvitop.log 2>&1
段错误

cat nvitop.log
[DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.

Expected behavior

No response

Additional context

No response

@jue-jue-zi jue-jue-zi added the bug Something isn't working label Jan 13, 2025
@XuehaiPan
Copy link
Owner

感谢提交 issue。我将尽快 fix 这个 bug。

@XuehaiPan
Copy link
Owner

可以试试:

pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop

@XuehaiPan XuehaiPan added pynvml Something related to the `nvidia-ml-py` package api Something related to the core APIs labels Jan 13, 2025
@jue-jue-zi
Copy link
Author

可以试试:

pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop

正常了👍

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs bug Something isn't working pynvml Something related to the `nvidia-ml-py` package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants