-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP not moving batch to device? #4987
Comments
Hi! thanks for your contribution!, great first issue! |
Hi @cccntu, DDP moves the batch to the device internally, hence why this is missing from the accelerator code. The examples all work with DDP, have a look here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/simple_image_classifier.py If you're able to replicate the error using the bug report model, we'll be able to help you get to a solution! https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py |
Hi @SeanNaren , Thanks for the reply.
What do you mean internally, can you give me some pointers? I think there should be a section in documentation for questions like "How does pl does internally?", with explanation and links to actual code. My current guess is that I am using huggingface's Also I think I found a bug about ddp: |
That is strange! Internally we just use standard PyTorch DDP, which scatters inputs before passing them into the forward function: https://github.com/pytorch/pytorch/blob/v1.7.0/torch/nn/parallel/distributed.py#L617 This brings them to the current GPU devices automatically before passing through to the forward. Strange that this isn't happening automatically for |
@SeanNaren the DDP built-in scatter only moves data to the devices for tensors and tensors in collections like list, tuple, etc. In our accelerator base class we have this method This is currently called for single gpu and tpu accelerators and not for distributed accelerators. If @cccntu runs with these accelerators I am sure their |
This should be accelerator=ddp. |
I am working on the minimum example, but I think @awaelchli is right. Still, using |
You don't necessarily need to work on a reproducible example, since this is not a bug and rather a limitation of scattering. We are aware of this. But if you want it is certainly appreciated.
No, how could it? Custom python objects like BatchEncoding don't have a batch size / batch dimension so scattering is not defined there. As proposed in #1206 and #2350 we need a way for the user to define scatter and gather for these objects. |
@awaelchli Thanks for the explanation and links. Here is the reproducible example I wrote for future reference. https://gist.github.com/cccntu/967d9624d37024875e6cd094d2bf13ae
I just checked again using nvidia-smi, seems the computation does run on gpu 3,4, however it also occupies approximately the same amount of memory on gpu 0,1. |
the oom may be related to #4705 |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This is fixed by #5195 for single device/single process (like DDP). |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This should be fixed via #5195! |
Hi, I am using 1.0.8. I encountered error saying the input is not on the same device as the model. I printed the inputs and found out they are on cpu. I noticed this code below does not move the inputs to device.
https://github.com/PyTorchLightning/pytorch-lightning/blob/0979e2ce0f04ffa4facc13f08dc8d1612cdeae3e/pytorch_lightning/accelerators/ddp_accelerator.py#L153-L159
I added 2 lines of code and it seems to run.
But the loss doesn't update during epoch and becomes nan after one epoch.
Is there a simple ddp example I can run?
Thanks!
The text was updated successfully, but these errors were encountered: