feat(nvidia): build pytorch to get older cuda compute capabilities and setup arm64 support #578

ndbaker1 · 2025-02-04T01:35:11Z

Issue #, if available:

Description of changes:

Running these containers on instances like the g5g family will not work because the compute capability of the NVIDIA T4 architecture (7.5) is older than whats provided in pytorch (generally 8.0+ at this point).

This PR sets up arm64 compatibility and builds pytorch in the images to gain older cuda compute capability support.

Testing

go test -tags=e2e ./test/cases/nvidia-inference/... \
    --test.timeout=60m \
    --test.v \
    --test.run=TestBertInference \
    --bertInferenceImage=$IMAGE \
    --inferenceMode=latency \
    --gpuRequested=1

2025/02/04 01:32:53 [INFO] Applying NVIDIA device plugin.
2025/02/04 01:32:59 [INFO] NVIDIA device plugin is ready.
2025/02/04 01:32:59 [INFO] Validating cluster has at least 1 GPU(s).
2025/02/04 01:33:09 [INFO] Node ip-192-168-12-159.us-west-2.compute.internal (type: g5g.2xlarge) has no GPU capacity.
2025/02/04 01:33:09 [INFO] Node ip-192-168-28-71.us-west-2.compute.internal (type: g5g.2xlarge) meets the request of 1 GPU(s).
2025/02/04 01:33:09 [INFO] GPU capacity check passed.
=== RUN   TestBertInference
=== RUN   TestBertInference/bert-inference
2025/02/04 01:33:09 [INFO] Rendering BERT inference manifest...
2025/02/04 01:33:09 [INFO] Applying BERT inference manifest...
2025/02/04 01:33:09 [INFO] BERT inference manifest applied successfully.
=== RUN   TestBertInference/bert-inference/BERT_inference_Job_succeeds
2025/02/04 01:33:09 [INFO] Checking BERT inference job completion...
2025/02/04 01:33:29 [INFO] BERT inference job succeeded. Gathering logs...
2025/02/04 01:33:29 [INFO] BERT inference job completed in 20.020529287s
2025/02/04 01:33:29 [INFO] Pod bert-inference-xr8cv is running on node ip-192-168-28-71.us-west-2.compute.internal
2025/02/04 01:33:29 [INFO] Retrieving logs from pod bert-inference-xr8cv...
2025/02/04 01:33:29 [INFO] Logs from Pod bert-inference-xr8cv:
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
[2025-02-04 01:33:14,114] [INFO] [BERTInference] [INFO] Found 1 GPU(s). GPU is available.
[2025-02-04 01:33:14,114] [INFO] [BERTInference] [INFO] Running inference in latency mode with batch size 1.
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2025-02-04 01:33:22,212] [INFO] [BERTInference] [BERT_INFERENCE_METRICS] mode=latency avg_time_per_batch=0.015120 throughput_samples_per_sec=66.135404
2025/02/04 01:33:29 [INFO] Completed log stream for pod bert-inference-xr8cv.
2025/02/04 01:33:29 [INFO] Cleaning up BERT inference job resources...
2025/02/04 01:33:29 [INFO] BERT inference job resources cleaned up.
--- PASS: TestBertInference (20.12s)
    --- PASS: TestBertInference/bert-inference (20.12s)
        --- PASS: TestBertInference/bert-inference/BERT_inference_Job_succeeds (20.07s)
PASS
2025/02/04 01:33:29 [INFO] Cleaning up NVIDIA device plugin.
2025/02/04 01:33:29 [INFO] Device plugin cleanup complete.
2025/02/04 01:33:29 [INFO] Test environment finished with exit code 0
ok  	github.com/aws/aws-k8s-tester/test/cases/nvidia-inference       35.908s

additionally tested for backwards compatibility with x86_64 nvidia instance types.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mattcjo · 2025-02-06T22:19:29Z

test/images/nvidia-inference/Dockerfile

+###############################################################################
+# 4) Install Pytorch from Source
+###############################################################################
+# envs needed to make the path of NVCC known to the compilation
+ENV CUDA_HOME=/usr/local/cuda
+ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
+ENV PATH=$PATH:$CUDA_HOME/bin
+# this list could be minimized based on the supported GPUs
+ENV TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.7 8.9 9.0"
+
+RUN pip3 install typing-extensions sympy
+RUN git clone \
+      --recursive https://github.com/pytorch/pytorch.git \
+      --branch $PYTORCH_BRANCH \
+ && cd pytorch && eval "$PYTORCH_BUILD_ENV python3 setup.py install" && cd .. \
+ && rm -rf pytorch


Any idea how long this step takes? Just curious

the workflow took ~5hr 30mins, which the bulk of is here 😅

test/images/nvidia-training/requirements.txt

mattcjo · 2025-02-06T22:26:03Z

.github/workflows/ci.yaml

-    - run: docker build --file test/images/nvidia-inference/Dockerfile test/images/nvidia-inference
+    - run: |
+        docker build --file test/images/nvidia-inference/Dockerfile test/images/nvidia-inference \
+          --build-arg PYTORCH_BUILD_ENV="MAX_JOBS=8 BUILD_TEST=0 USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 USE_DISTRIBUTED=0"


Just curious... Any reason for choosing a value of 8 for MAX_JOBS?

yea this was just manually tuned based on OOM errors for MAX_JOBS being too high and searched for a value that passed in under the 6hr default gh-action limit

mattcjo

LGTM

…d setup arm64 support

ndbaker1 · 2025-02-13T00:24:25Z

test/images/nvidia-training/train.py

@@ -110,8 +110,7 @@ def main():
    # Retrieve environment variables
    rank = int(os.getenv("OMPI_COMM_WORLD_RANK", "0"))
    world_size = int(os.getenv("OMPI_COMM_WORLD_SIZE", "1"))
-    num_gpus_per_node = int(os.getenv("NUM_GPUS_PER_NODE", "8"))
-    local_rank = rank % num_gpus_per_node
+    local_rank = int(os.getenv("OMPI_COMM_WORLD_LOCAL_RANK", "0"))


@mattcjo any reason we didn't do this before based on https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html?

ndbaker1 · 2025-02-13T00:25:13Z

test/images/nvidia/gpu_unit_tests/tests/test_basic.sh

-test_04_bus_grind()
-{
-    assert_status_code 0 "$DEMU_SUITE_DIR/busGrind -a"


the busGrind suite was not in the upstream cuda samples repo that i moved to instead of the cuda-demo-suite package that was only available for x86_64

mattcjo

LGTM

ndbaker1 force-pushed the pytorch branch 7 times, most recently from dfda945 to 01297b0 Compare February 6, 2025 03:44

mattcjo reviewed Feb 6, 2025

View reviewed changes

ndbaker1 force-pushed the pytorch branch from 01297b0 to c59e725 Compare February 7, 2025 00:45

ndbaker1 marked this pull request as ready for review February 7, 2025 01:14

mattcjo approved these changes Feb 7, 2025

View reviewed changes

ndbaker1 force-pushed the pytorch branch 9 times, most recently from 5a2805e to e867863 Compare February 12, 2025 23:14

feat(nvidia): build pytorch to get older cuda compute capabilities an…

27095bc

…d setup arm64 support

ndbaker1 force-pushed the pytorch branch from e867863 to 27095bc Compare February 12, 2025 23:49

ndbaker1 commented Feb 13, 2025

View reviewed changes

ndbaker1 requested a review from mattcjo February 13, 2025 00:25

mattcjo approved these changes Feb 17, 2025

View reviewed changes

mselim00 approved these changes Feb 17, 2025

View reviewed changes

ndbaker1 merged commit f6edf1d into aws:main Feb 17, 2025
9 checks passed

ndbaker1 deleted the pytorch branch February 17, 2025 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nvidia): build pytorch to get older cuda compute capabilities and setup arm64 support #578

feat(nvidia): build pytorch to get older cuda compute capabilities and setup arm64 support #578

ndbaker1 commented Feb 4, 2025 •

edited

Loading

mattcjo Feb 6, 2025

ndbaker1 Feb 6, 2025

mattcjo Feb 6, 2025

ndbaker1 Feb 6, 2025

mattcjo left a comment

ndbaker1 Feb 13, 2025

ndbaker1 Feb 13, 2025

mattcjo left a comment

feat(nvidia): build pytorch to get older cuda compute capabilities and setup arm64 support #578

feat(nvidia): build pytorch to get older cuda compute capabilities and setup arm64 support #578

Conversation

ndbaker1 commented Feb 4, 2025 • edited Loading

Testing

mattcjo Feb 6, 2025

Choose a reason for hiding this comment

ndbaker1 Feb 6, 2025

Choose a reason for hiding this comment

mattcjo Feb 6, 2025

Choose a reason for hiding this comment

ndbaker1 Feb 6, 2025

Choose a reason for hiding this comment

mattcjo left a comment

Choose a reason for hiding this comment

ndbaker1 Feb 13, 2025

Choose a reason for hiding this comment

ndbaker1 Feb 13, 2025

Choose a reason for hiding this comment

mattcjo left a comment

Choose a reason for hiding this comment

ndbaker1 commented Feb 4, 2025 •

edited

Loading