Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nvidia): build pytorch to get older cuda compute capabilities and setup arm64 support #578

Merged
merged 1 commit into from
Feb 17, 2025

Conversation

ndbaker1
Copy link
Contributor

@ndbaker1 ndbaker1 commented Feb 4, 2025

Issue #, if available:

Description of changes:

Running these containers on instances like the g5g family will not work because the compute capability of the NVIDIA T4 architecture (7.5) is older than whats provided in pytorch (generally 8.0+ at this point).

This PR sets up arm64 compatibility and builds pytorch in the images to gain older cuda compute capability support.

Testing

go test -tags=e2e ./test/cases/nvidia-inference/... \
    --test.timeout=60m \
    --test.v \
    --test.run=TestBertInference \
    --bertInferenceImage=$IMAGE \
    --inferenceMode=latency \
    --gpuRequested=1

2025/02/04 01:32:53 [INFO] Applying NVIDIA device plugin.
2025/02/04 01:32:59 [INFO] NVIDIA device plugin is ready.
2025/02/04 01:32:59 [INFO] Validating cluster has at least 1 GPU(s).
2025/02/04 01:33:09 [INFO] Node ip-192-168-12-159.us-west-2.compute.internal (type: g5g.2xlarge) has no GPU capacity.
2025/02/04 01:33:09 [INFO] Node ip-192-168-28-71.us-west-2.compute.internal (type: g5g.2xlarge) meets the request of 1 GPU(s).
2025/02/04 01:33:09 [INFO] GPU capacity check passed.
=== RUN   TestBertInference
=== RUN   TestBertInference/bert-inference
2025/02/04 01:33:09 [INFO] Rendering BERT inference manifest...
2025/02/04 01:33:09 [INFO] Applying BERT inference manifest...
2025/02/04 01:33:09 [INFO] BERT inference manifest applied successfully.
=== RUN   TestBertInference/bert-inference/BERT_inference_Job_succeeds
2025/02/04 01:33:09 [INFO] Checking BERT inference job completion...
2025/02/04 01:33:29 [INFO] BERT inference job succeeded. Gathering logs...
2025/02/04 01:33:29 [INFO] BERT inference job completed in 20.020529287s
2025/02/04 01:33:29 [INFO] Pod bert-inference-xr8cv is running on node ip-192-168-28-71.us-west-2.compute.internal
2025/02/04 01:33:29 [INFO] Retrieving logs from pod bert-inference-xr8cv...
2025/02/04 01:33:29 [INFO] Logs from Pod bert-inference-xr8cv:
Error in cpuinfo: prctl(PR_SVE_GET_VL) failed
/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
[2025-02-04 01:33:14,114] [INFO] [BERTInference] [INFO] Found 1 GPU(s). GPU is available.
[2025-02-04 01:33:14,114] [INFO] [BERTInference] [INFO] Running inference in latency mode with batch size 1.
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2025-02-04 01:33:22,212] [INFO] [BERTInference] [BERT_INFERENCE_METRICS] mode=latency avg_time_per_batch=0.015120 throughput_samples_per_sec=66.135404
2025/02/04 01:33:29 [INFO] Completed log stream for pod bert-inference-xr8cv.
2025/02/04 01:33:29 [INFO] Cleaning up BERT inference job resources...
2025/02/04 01:33:29 [INFO] BERT inference job resources cleaned up.
--- PASS: TestBertInference (20.12s)
    --- PASS: TestBertInference/bert-inference (20.12s)
        --- PASS: TestBertInference/bert-inference/BERT_inference_Job_succeeds (20.07s)
PASS
2025/02/04 01:33:29 [INFO] Cleaning up NVIDIA device plugin.
2025/02/04 01:33:29 [INFO] Device plugin cleanup complete.
2025/02/04 01:33:29 [INFO] Test environment finished with exit code 0
ok  	github.com/aws/aws-k8s-tester/test/cases/nvidia-inference       35.908s

additionally tested for backwards compatibility with x86_64 nvidia instance types.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ndbaker1 ndbaker1 force-pushed the pytorch branch 7 times, most recently from dfda945 to 01297b0 Compare February 6, 2025 03:44
Comment on lines +82 to +97
###############################################################################
# 4) Install Pytorch from Source
###############################################################################
# envs needed to make the path of NVCC known to the compilation
ENV CUDA_HOME=/usr/local/cuda
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
ENV PATH=$PATH:$CUDA_HOME/bin
# this list could be minimized based on the supported GPUs
ENV TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.7 8.9 9.0"

RUN pip3 install typing-extensions sympy
RUN git clone \
--recursive https://github.com/pytorch/pytorch.git \
--branch $PYTORCH_BRANCH \
&& cd pytorch && eval "$PYTORCH_BUILD_ENV python3 setup.py install" && cd .. \
&& rm -rf pytorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how long this step takes? Just curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the workflow took ~5hr 30mins, which the bulk of is here 😅

- run: docker build --file test/images/nvidia-inference/Dockerfile test/images/nvidia-inference
- run: |
docker build --file test/images/nvidia-inference/Dockerfile test/images/nvidia-inference \
--build-arg PYTORCH_BUILD_ENV="MAX_JOBS=8 BUILD_TEST=0 USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 USE_DISTRIBUTED=0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious... Any reason for choosing a value of 8 for MAX_JOBS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this was just manually tuned based on OOM errors for MAX_JOBS being too high and searched for a value that passed in under the 6hr default gh-action limit

@ndbaker1 ndbaker1 marked this pull request as ready for review February 7, 2025 01:14
Copy link
Contributor

@mattcjo mattcjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ndbaker1 ndbaker1 force-pushed the pytorch branch 9 times, most recently from 5a2805e to e867863 Compare February 12, 2025 23:14
@@ -110,8 +110,7 @@ def main():
# Retrieve environment variables
rank = int(os.getenv("OMPI_COMM_WORLD_RANK", "0"))
world_size = int(os.getenv("OMPI_COMM_WORLD_SIZE", "1"))
num_gpus_per_node = int(os.getenv("NUM_GPUS_PER_NODE", "8"))
local_rank = rank % num_gpus_per_node
local_rank = int(os.getenv("OMPI_COMM_WORLD_LOCAL_RANK", "0"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -32 to -34
test_04_bus_grind()
{
assert_status_code 0 "$DEMU_SUITE_DIR/busGrind -a"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the busGrind suite was not in the upstream cuda samples repo that i moved to instead of the cuda-demo-suite package that was only available for x86_64

@ndbaker1 ndbaker1 requested a review from mattcjo February 13, 2025 00:25
Copy link
Contributor

@mattcjo mattcjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ndbaker1 ndbaker1 merged commit f6edf1d into aws:main Feb 17, 2025
9 checks passed
@ndbaker1 ndbaker1 deleted the pytorch branch February 17, 2025 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants