How to mitigate the CUDA-error: Out-of-memory? #10

thucz · 2023-12-23T13:03:49Z

Hello! Since I have only 4 A100 available now, I reduced the chunk size from 16*64 to 256. But Out-of-memory error still appears, do you have any idea to fix it?

thucz · 2023-12-23T13:42:22Z

I found even I used 8 A100 card with your given parameters: chunk size 16*64, the error still happened.

zubair-irshad · 2023-12-23T16:36:19Z

I am only able to check currently with 7 GPUs and the training runs fine, can you share your gpu utilization? Mine is shared below and it utilizes around 40GB memory per gpu. This is with using chunk size = 16 * 64

zubair-irshad · 2023-12-23T16:37:01Z

Here is my training progression:

thucz · 2023-12-24T03:11:45Z

I use watch -n1 nvidia-smi to observe the gpu utilization. It reached 40G and then crashed.

Even I used chunk size 512 with 8A100 GPUs, OOM still happened.

Do you have any advice to reduce GPU memory?

thucz · 2023-12-24T04:12:20Z

I changed a docker image with CUDA11.3. The code can run normally. Previously I used the docker image with CUDA11.7. Sorry for bothering you.

thucz · 2023-12-24T04:13:04Z

I changed a docker image with CUDA11.3. The code can run normally. Previously I used the docker image with CUDA11.7. Sorry for bothering you.

But I still wonder how to reduce GPU memory because I want to run it on other Cards like V100 (32GB)

thucz · 2023-12-24T04:29:15Z

Just now I found that It utilized about 58G per GPU on 80G cards. It is so weird.

zubair-irshad · 2023-12-24T17:13:56Z

Great to know that you have the code working on your end on A100 GPUs. To further reduce the memory, you can try the following:

We randomly sample 500 rays from 20 destination views for rendering the target pixels. You could try reducing either of these to reduce memory. Please note that 500 is already a very low number, so I would suggest playing with the other parameter i.e. num_destination views first.
Our data loader needs some refactoring. Currently, we load all annotations i.e. NOCS maps, instance maps. This might reduce some memory, but not that much.
One could of course reduce the img_size to train and fine-tune with a higher resolution.
I tried to improve grid sampling which is probably the part that requires the most memory in a single forward pass, and we have some batchifying code commented here which was a WIP and never truly trested. Please feel free to also give this a try, but note that we have not benchmarked our numbers with this batchification.

All of the above, we have not tried on our end locally, so we haven't benchmarked the exact memory savings they would generate, but please feel free to give these a try and let us know how it goes. Hope it helps your research!

thucz · 2023-12-25T02:43:05Z

Thanks a lot! I will try your advice.

thucz closed this as completed Dec 23, 2023

thucz reopened this Dec 23, 2023

thucz closed this as completed Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to mitigate the CUDA-error: Out-of-memory? #10

How to mitigate the CUDA-error: Out-of-memory? #10

thucz commented Dec 23, 2023

thucz commented Dec 23, 2023

zubair-irshad commented Dec 23, 2023

zubair-irshad commented Dec 23, 2023

thucz commented Dec 24, 2023

thucz commented Dec 24, 2023

thucz commented Dec 24, 2023

thucz commented Dec 24, 2023 •

edited

Loading

zubair-irshad commented Dec 24, 2023

thucz commented Dec 25, 2023

How to mitigate the CUDA-error: Out-of-memory? #10

How to mitigate the CUDA-error: Out-of-memory? #10

Comments

thucz commented Dec 23, 2023

thucz commented Dec 23, 2023

zubair-irshad commented Dec 23, 2023

zubair-irshad commented Dec 23, 2023

thucz commented Dec 24, 2023

thucz commented Dec 24, 2023

thucz commented Dec 24, 2023

thucz commented Dec 24, 2023 • edited Loading

zubair-irshad commented Dec 24, 2023

thucz commented Dec 25, 2023

thucz commented Dec 24, 2023 •

edited

Loading