Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview: GPU support #717

Merged
merged 19 commits into from
Sep 12, 2024
Merged

Preview: GPU support #717

merged 19 commits into from
Sep 12, 2024

Conversation

BMurri
Copy link
Collaborator

@BMurri BMurri commented Jun 4, 2024

This is an initial implementation of GPU support in TES. It has some current limitations:

  • assumes compute nodes have outgoing public internet access
  • Only NC, ND, and NVv3 SKUs are supported currently supported.
  • The backend_parameters in the task resources must include vm_size set to a specific SKU within the above-described series of VM families.

This follows a combination of NVIDIA, Docker, and Azure documentation for enabling containers to use GPUs. Specifically, this:

Things in the NVIDIA documentation that are NOT implemented (it's unclear what should or should not be implemented based on our use cases)

  • selectable driver capabilities mounted inside docker containers
  • environment variable advanced configuration (OCI spec)
  • support for Container Device Interface (CDI)

Note that the azure documentation for the VM extension for GPU support on linux points the reader to the following EULA: https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/

addresses microsoft/CromwellOnAzure#356

@BMurri
Copy link
Collaborator Author

BMurri commented Jun 5, 2024

This has been tested and it works, as per above description.

TODO in order to move from Preview to fully supported option (some of these may be debated):

  • Check the V1 and V2 images we point to today, to determine if the drivers, tools and configuration already exist. If so, this can be mostly reversed (with the exception of the flags/runner) and we are done.
  • If not, or if partially done, we will need to stage appropriate binaries during deployment and adjust the scripts so that private (and Terra) can be accommodated.

I think this, as a preview, can possibly be shipped this week.

@BMurri
Copy link
Collaborator Author

BMurri commented Aug 20, 2024

Remaining work: Ensure that this functionality is either turned off or can be safely used if there is no public internet access (I don't know if the VM extensions can be reached from a private virtual network with no public IP access).

Bonus work: if we recognize that the image already has the drivers, just pass them through (don't try to reinstall them).

@BMurri
Copy link
Collaborator Author

BMurri commented Sep 12, 2024

For the future: the actions taken in the start task for this are not needed for certain VM images, so we can reduce the node startup times (faster tasks, lower costs).

@BMurri BMurri merged commit a2fb49f into main Sep 12, 2024
7 checks passed
@BMurri BMurri deleted the bmurri/gpu branch September 12, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants