Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

linusseelinger · 2024-03-05T14:32:30Z

I am trying to start a HQ worker directly on my system (Fedora). I have an nvidia GPU and nvidia's proprietary driver installed, but not the CUDA package. The latter seems to include nvidia-smi, which I don't have.

Launching a worker gives a corresponding error, which seems to loop infinitely with 1s delays:

 ./hq worker start
2024-03-05T14:13:34Z INFO Detected 1 GPUs from procs
2024-03-05T14:13:34Z INFO Detected 33304358912B of memory (31.02 GiB)
2024-03-05T14:13:34Z INFO Starting hyperqueue worker nightly-2024-02-28-d42cc6563708f799c921b3d05678adc5fcef2744
2024-03-05T14:13:34Z INFO Connecting to: xps-9530:33635
2024-03-05T14:13:34Z INFO Listening on port 36431
2024-03-05T14:13:34Z INFO Connecting to server (candidate addresses = [[fe80::5fcd:941f:68f6:5efc%2]:33635, [2a00:1398:200:202:9d65:e4b1:e28b:b0e0]:33635, 172.23.213.13:33635])
+-------------------+----------------------------------+
| Worker ID         | 2                                |
| Hostname          | xps-9530                         |
| Started           | "2024-03-05T14:13:34.162287491Z" |
| Data provider     | xps-9530:36431                   |
| Working directory | /tmp/hq-worker.lJaUBMB2LjvD/work |
| Logging directory | /tmp/hq-worker.lJaUBMB2LjvD/logs |
| Heartbeat         | 8s                               |
| Idle timeout      | None                             |
| Resources         | cpus: 20                         |
|                   | gpus/nvidia: 1                   |
|                   | mem: 31.02 GiB                   |
| Time Limit        | None                             |
| Process pid       | 150177                           |
| Group             | default                          |
| Manager           | None                             |
| Manager Job ID    | N/A                              |
+-------------------+----------------------------------+
2024-03-05T14:13:35Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:36Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:37Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:38Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
2024-03-05T14:13:39Z ERROR Failed to fetch NVIDIA GPU state: GenericError("Cannot execute nvidia-smi: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }")
...

Could the worker be modified to run in that condition, e.g. just remove GPU suport if nvidia-smi is not available?

The text was updated successfully, but these errors were encountered:

Kobzol · 2024-03-05T15:35:48Z

Hi, workers automatically scan the usage of their node (including GPUs) every second by default, this data is being sent regularly to the server. Unless you use the dashboard, this information isn't currently used for anything though, so you can disable it if you want:

$ hq worker start --overview-interval 0s

Does this remove the error from the log for you?

linusseelinger · 2024-03-05T15:54:37Z

Thanks a lot for your quick reply! Indeed, that option removes the error message from logs.

Turns out I had my own bug blocking the UM-Bridge code we are building on top of hyperqueue, so I thought the worker just never became responsive...

Still might be useful to limit logging for this kind of error, or turn this particular one into a warning if it doesn't mess with regular operation?

Kobzol · 2024-03-05T15:56:08Z

Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.

spirali · 2024-03-05T16:03:00Z

Yeah, that could be worth doing. I'm not sure how to recognize perfectly if the error is transient or if the thing just doesn't exist at all though. I'll try to add some better detection of this.

I would guess that nvidia-smi normally does not fail. So just a quick fix: when a first error occurs then stop calling it in all subsequent data collecting iterations.

Kobzol self-assigned this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

linusseelinger commented Mar 5, 2024

Kobzol commented Mar 5, 2024

linusseelinger commented Mar 5, 2024

Kobzol commented Mar 5, 2024

spirali commented Mar 5, 2024

Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

Starting worker on machine with nvidia GPU but without nvidia-smi/CUDA #685

Comments

linusseelinger commented Mar 5, 2024

Kobzol commented Mar 5, 2024

linusseelinger commented Mar 5, 2024

Kobzol commented Mar 5, 2024

spirali commented Mar 5, 2024