Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to mitigate the CUDA-error: Out-of-memory? #10

Closed
thucz opened this issue Dec 23, 2023 · 9 comments
Closed

How to mitigate the CUDA-error: Out-of-memory? #10

thucz opened this issue Dec 23, 2023 · 9 comments

Comments

@thucz
Copy link

thucz commented Dec 23, 2023

Hello! Since I have only 4 A100 available now, I reduced the chunk size from 16*64 to 256. But Out-of-memory error still appears, do you have any idea to fix it?

@thucz
Copy link
Author

thucz commented Dec 23, 2023

I found even I used 8 A100 card with your given parameters: chunk size 16*64, the error still happened.

@thucz thucz closed this as completed Dec 23, 2023
@thucz thucz reopened this Dec 23, 2023
@zubair-irshad
Copy link
Owner

I am only able to check currently with 7 GPUs and the training runs fine, can you share your gpu utilization? Mine is shared below and it utilizes around 40GB memory per gpu. This is with using chunk size = 16 * 64

image

@zubair-irshad
Copy link
Owner

Here is my training progression:

image

@thucz
Copy link
Author

thucz commented Dec 24, 2023

image

I use watch -n1 nvidia-smi to observe the gpu utilization. It reached 40G and then crashed.

Even I used chunk size 512 with 8A100 GPUs, OOM still happened.

Do you have any advice to reduce GPU memory?

@thucz
Copy link
Author

thucz commented Dec 24, 2023

I changed a docker image with CUDA11.3. The code can run normally. Previously I used the docker image with CUDA11.7. Sorry for bothering you.

@thucz thucz closed this as completed Dec 24, 2023
@thucz
Copy link
Author

thucz commented Dec 24, 2023

I changed a docker image with CUDA11.3. The code can run normally. Previously I used the docker image with CUDA11.7. Sorry for bothering you.

But I still wonder how to reduce GPU memory because I want to run it on other Cards like V100 (32GB)

@thucz
Copy link
Author

thucz commented Dec 24, 2023

Just now I found that It utilized about 58G per GPU on 80G cards. It is so weird.

image

@zubair-irshad
Copy link
Owner

Great to know that you have the code working on your end on A100 GPUs. To further reduce the memory, you can try the following:

  1. We randomly sample 500 rays from 20 destination views for rendering the target pixels. You could try reducing either of these to reduce memory. Please note that 500 is already a very low number, so I would suggest playing with the other parameter i.e. num_destination views first.

  2. Our data loader needs some refactoring. Currently, we load all annotations i.e. NOCS maps, instance maps. This might reduce some memory, but not that much.

  3. One could of course reduce the img_size to train and fine-tune with a higher resolution.

  4. I tried to improve grid sampling which is probably the part that requires the most memory in a single forward pass, and we have some batchifying code commented here which was a WIP and never truly trested. Please feel free to also give this a try, but note that we have not benchmarked our numbers with this batchification.

All of the above, we have not tried on our end locally, so we haven't benchmarked the exact memory savings they would generate, but please feel free to give these a try and let us know how it goes. Hope it helps your research!

@thucz
Copy link
Author

thucz commented Dec 25, 2023

Thanks a lot! I will try your advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants