-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cutorch.synchronize() does not work #9
Comments
Do you have a small repro? A known issue with nccl is that it will hang if some other thread calls cudaFree when nccl kernels are being scheduled. You can try running your code with env var THC_CACHING_ALLOCATOR=1, and see if problem persists. |
I am using this code, But I changed it to use nccl. |
No changes are necessary to this code to use nccl, it should do it automatically. Take a look at https://github.com/facebook/fb.resnet.torch/, it also uses nccl without deadlocks. If you are adding cutorch.synchronize in such a way that is is called while nccl kernels are being scheduled, nccl will deadlock, you should make sure you are not doing that. |
It doesn't seems that it uses nccl by default.
I changed it to use nccl: `local model_single = model
Without doing the change it uses default communication between GPUs which is very slow. |
Also, I find out that it does work for 2 GPUs but it does hang for 4 GPUs. |
I installed NCCL and I am trying to use it.
Without using NCCL, everythings seems fine but slow.
But when I am trying to use NCCL, In my code when I call cutorch.synchronize() it just stops and it does nothing without making any error.
How can I findout the root of the problem?
Thanks
The text was updated successfully, but these errors were encountered: