Preview: GPU support #717

BMurri · 2024-06-04T20:26:19Z

This is an initial implementation of GPU support in TES. It has some current limitations:

assumes compute nodes have outgoing public internet access
Only NC, ND, and NVv3 SKUs are supported currently supported.
The backend_parameters in the task resources must include vm_size set to a specific SKU within the above-described series of VM families.

This follows a combination of NVIDIA, Docker, and Azure documentation for enabling containers to use GPUs. Specifically, this:

uses the azure vm extension to ensure the CUDA drivers are installed
installs the NVIDIA container toolkit
implements (a portion of) the --gpus all docker cli implementation

Things in the NVIDIA documentation that are NOT implemented (it's unclear what should or should not be implemented based on our use cases)

selectable driver capabilities mounted inside docker containers
environment variable advanced configuration (OCI spec)
support for Container Device Interface (CDI)

Note that the azure documentation for the VM extension for GPU support on linux points the reader to the following EULA: https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/

addresses microsoft/CromwellOnAzure#356

BMurri · 2024-06-05T02:59:36Z

This has been tested and it works, as per above description.

TODO in order to move from Preview to fully supported option (some of these may be debated):

Check the V1 and V2 images we point to today, to determine if the drivers, tools and configuration already exist. If so, this can be mostly reversed (with the exception of the flags/runner) and we are done.
If not, or if partially done, we will need to stage appropriate binaries during deployment and adjust the scripts so that private (and Terra) can be accommodated.

I think this, as a preview, can possibly be shipped this week.

src/TesApi.Web/BatchScheduler.cs

src/Tes.Runner/Docker/DockerExecutor.cs

src/Tes.Runner.Test/Docker/DockerExecutorTests.cs

src/TesApi.Web/scripts/config-n-gpu-apt.sh

src/Tes.Runner/Docker/DockerExecutor.cs

src/CommonUtilities/Models/NodeTask.cs

src/TesApi.Web/BatchScheduler.cs

…iver not loaded: unknown

BMurri · 2024-08-20T23:08:37Z

Remaining work: Ensure that this functionality is either turned off or can be safely used if there is no public internet access (I don't know if the VM extensions can be reached from a private virtual network with no public IP access).

Bonus work: if we recognize that the image already has the drivers, just pass them through (don't try to reinstall them).

BMurri · 2024-09-12T21:05:24Z

For the future: the actions taken in the start task for this are not needed for certain VM images, so we can reduce the node startup times (faster tasks, lower costs).

Preview: GPU support

a908df2

BMurri requested review from giventocode, MattMcL4475 and jlester-msft June 4, 2024 20:26

BMurri added 2 commits June 4, 2024 13:28

Add additional container mounts

f090b71

add missing configuration

985750c

MattMcL4475 reviewed Jun 5, 2024

View reviewed changes

src/TesApi.Web/BatchScheduler.cs Outdated Show resolved Hide resolved

MattMcL4475 reviewed Jun 5, 2024

View reviewed changes

src/Tes.Runner/Docker/DockerExecutor.cs Outdated Show resolved Hide resolved

MattMcL4475 reviewed Jun 5, 2024

View reviewed changes

src/Tes.Runner.Test/Docker/DockerExecutorTests.cs Show resolved Hide resolved

MattMcL4475 reviewed Jun 5, 2024

View reviewed changes

src/TesApi.Web/scripts/config-n-gpu-apt.sh Show resolved Hide resolved

MattMcL4475 approved these changes Jun 5, 2024

View reviewed changes

BMurri added 2 commits June 5, 2024 06:35

Add link to NVIDIA docs

539a16c

Add link to VM extension doc

8682494

giventocode approved these changes Jun 5, 2024

View reviewed changes

src/Tes.Runner/Docker/DockerExecutor.cs Outdated Show resolved Hide resolved

src/CommonUtilities/Models/NodeTask.cs Outdated Show resolved Hide resolved

src/TesApi.Web/BatchScheduler.cs Show resolved Hide resolved

BMurri added 14 commits June 10, 2024 10:10

Correctly detect VMFamilies with spaces

035ce5a

Merge branch 'main' into bmurri/gpu

39eda2a

Ignore NVadsA10 v5-series due to initialization error: nvml error: dr…

43a0e35

…iver not loaded: unknown

Ignore NGADSV620v1 due to lack of Linux drivers

303af6a

Address feedback

d2be67c

small refactor

857e9dc

Remove NVv4 and NVv5 series families

515c2d1

Cleanup

7cefebb

Merge branch 'main' into bmurri/gpu

415a15e

more cleanup

20b301d

Merge branch 'main' into bmurri/gpu

9c7b3ef

Merge branch 'main' into bmurri/gpu

09d8258

Merge branch 'main' into bmurri/gpu

e53ab84

Merge branch 'main' into bmurri/gpu

ce72b43

BMurri merged commit a2fb49f into main Sep 12, 2024
7 checks passed

BMurri deleted the bmurri/gpu branch September 12, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preview: GPU support #717

Preview: GPU support #717

BMurri commented Jun 4, 2024 •

edited

Loading

BMurri commented Jun 5, 2024 •

edited

Loading

BMurri commented Aug 20, 2024

BMurri commented Sep 12, 2024

Preview: GPU support #717

Preview: GPU support #717

Conversation

BMurri commented Jun 4, 2024 • edited Loading

BMurri commented Jun 5, 2024 • edited Loading

BMurri commented Aug 20, 2024

BMurri commented Sep 12, 2024

BMurri commented Jun 4, 2024 •

edited

Loading

BMurri commented Jun 5, 2024 •

edited

Loading