Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

Open
linusseelinger opened this issue Mar 5, 2024 · 4 comments
Assignees

Comments

@linusseelinger
Copy link

I am trying to start a HQ worker directly on my system (Fedora). I have an nvidia GPU and nvidia's proprietary driver installed, but not the CUDA package. The latter seems to include nvidia-smi, which I don't have.

Launching a worker gives a corresponding error, which seems to loop infinitely with 1s delays:

 ./hq worker start
2024-03-05T14:13:34Z INFO Detected 1 GPUs from procs
2024-03-05T14:13:34Z INFO Detected 33304358912B of memory (31.02 GiB)
2024-03-05T14:13:34Z INFO Starting hyperqueue worker nightly-2024-02-28-d42cc6563708f799c921b3d05678adc5fcef2744
2024-03-05T14:13:34Z INFO Connecting to: xps-9530:33635
2024-03-05T14:13:34Z INFO Listening on port 36431
2024-03-05T14:13:34Z INFO Connecting to server (candidate addresses = [[fe80::5fcd:941f:68f6:5efc%2]:33635, [2a00:1398:200:202:9d65:e4b1:e28b:b0e0]:33635, 172.23.213.13:33635])
+-------------------+----------------------------------+
| Worker ID         | 2                                |
| Hostname          | xps-9530                         |
| Started           | "2024-03-05T14:13:34.162287491Z" |
| Data provider     | xps-9530:36431                   |
| Working directory | /tmp/hq-worker.lJaUBMB2LjvD/work |
| Logging directory | /tmp/hq-worker.lJaUBMB2LjvD/logs |
| Heartbeat         | 8s                               |
| Idle timeout      | None                             |
| Resources         | cpus: 20                         |
|                   | gpus/nvidia: 1                   |
|                   | mem: 31.02 GiB                   |
| Time Limit        | None                             |
| Process pid       | 150177                           |
| Group             | default                          |
| Manager           | None                             |
| Manager Job ID    | N/A                              |
+-------------------+----------------------------------+
2024-03-05T14:13:35Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:36Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:37Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:38Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:39Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
...

Could the worker be modified to run in that condition, e.g. just remove GPU suport if nvidia-smi is not available?

@Kobzol
Copy link
Collaborator

Kobzol commented Mar 5, 2024

Hi, workers automatically scan the usage of their node (including GPUs) every second by default, this data is being sent regularly to the server. Unless you use the dashboard, this information isn't currently used for anything though, so you can disable it if you want:

$ hq worker start --overview-interval 0s

Does this remove the error from the log for you?

@linusseelinger
Copy link
Author

Thanks a lot for your quick reply! Indeed, that option removes the error message from logs.

Turns out I had my own bug blocking the UM-Bridge code we are building on top of hyperqueue, so I thought the worker just never became responsive...

Still might be useful to limit logging for this kind of error, or turn this particular one into a warning if it doesn't mess with regular operation?

@Kobzol
Copy link
Collaborator

Kobzol commented Mar 5, 2024

Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.

@Kobzol Kobzol self-assigned this Mar 5, 2024
@spirali
Copy link
Collaborator

spirali commented Mar 5, 2024

Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.

I would guess that nvidia-smi normally does not fail. So just a quick fix: when a first error occurs then stop calling it in all subsequent data collecting iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants