-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298
base: main
Are you sure you want to change the base?
Conversation
4de713b
to
debe5b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.
src/mpi/ccl/cclcomm.c
Outdated
MPIR_ERR_CHECK(mpi_errno); | ||
|
||
if(!comm_ptr->cclcomm) { | ||
mpi_errno = MPIR_CCLcomm_init(comm_ptr, comm_ptr->rank); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup on first use makes sense. the cleanup call to MPIR_CCLcomm_free
is missing from this PR at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the free call to src/mpi/comm/commutil.c
src/mpi/ccl/cclcomm.c
Outdated
* CCLcomm functions, currently tied to NCCL | ||
*/ | ||
|
||
int MPIR_CCLcomm_init(MPIR_Comm * comm, int rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the other CCLcomm functions, you might want to move the NCCL-specifics to a MPIR_NCCL_
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check how we managed PMI wrappers in src/util/mpir_pmi.c
and src/util/mpir_pmi[x].inc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the NCCL-specific code is now separated into its own file and activated by its own macro ENABLE_NCCL
src/include/mpir_cclcomm.h
Outdated
#include <nccl.h> | ||
|
||
typedef struct MPIR_CCLcomm { | ||
MPIR_OBJECT_HEADER; | ||
MPIR_Comm *comm; | ||
ncclUniqueId id; | ||
ncclComm_t ncclcomm; | ||
cudaStream_t stream; | ||
} MPIR_CCLcomm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need the right autoconf/automake defines to only include this stuff when its available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I added temporary macros to show where the ifdefs will go in the new files, but I am not confident how to modify the autotools files to properly detect CCLs. I would appreciate your help with this!
src/mpi/ccl/cclcomm.c
Outdated
MPIR_ERR_NONE); | ||
MPIR_ERR_CHECK(mpi_errno); | ||
|
||
ret = cudaSetDevice(0); //TODO: Fix this for multi-GPU nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to move to coll invocation time and inherit whichever device the user-allocated buffer is located on. we could end up with multiple streams cached on the comm object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
I think another question down the road is whether or not there is enough commonality in the CCL impls that it can be abstracted in |
@raffenet Thanks for the comments! I will take a look at the individual feedback and make changes. |
For my testing, I added it to |
Ah, right 😄. In the case of |
src/mpi/ccl/Makefile.mk
Outdated
## | ||
|
||
mpi_core_sources += \ | ||
src/mpi/ccl/cclcomm.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place it under src/util
or src/mpid/common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic we should use to decide where to store these new files? Note there are now a few new source files in this directory.
src/mpi/ccl/cclcomm.c
Outdated
* CCLcomm functions, currently tied to NCCL | ||
*/ | ||
|
||
int MPIR_CCLcomm_init(MPIR_Comm * comm, int rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check how we managed PMI wrappers in src/util/mpir_pmi.c
and src/util/mpir_pmi[x].inc
src/mpi/ccl/cclcomm.c
Outdated
CUDA_ERR_CHECK(ret); | ||
ret = cudaStreamCreate(&(cclcomm->stream)); | ||
CUDA_ERR_CHECK(ret); | ||
ret = ncclCommInitRank(&(cclcomm->ncclcomm), comm_size, cclcomm->id, rank); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NCCL work with duplicate/shared devices? Does it work across multiple nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know NVIDIA has released some new features recently to enable multiple processes to share a GPU, but I think it is rare. My expectation is that this feature (and GPU-enabled MPI in general) is one process per GPU.
NCCL does work across multiple nodes. It natively supports infiniband and some others natively and there are plugins for other transports like OFI.
Lazy initialization should work.
After #7264, you only need deal with internal datatypes which can be placed into a switch clause.
Something similar to how we wrap the PMIs (
|
debe5b4
to
ee76028
Compare
b237081
to
645aef1
Compare
I have updated this prototype and am re-requesting reviews. The implementation is now much more fleshed out. I separated the CCL framework and the NCCL calls, and I have integrated them in the allreduce path via a new "ccl" algorithm. The main thing missing at this point is the autoconf/automake detection of the NCCL path. If someone can point me in the right direction on this, I would greatly appreciate it! Regarding:
It is currently implemented as the second option, but long-term I would like to do the first. Let me know what you think! |
This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have: * When/where to properly init/free the `CCLcomm` structure * How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes) * How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.) * How to modify `src/mpi/coll/mpir_coll.c` to consider `CCL_Allreduce`. * How to pull-in NCCL from the environment and/or a configure argument `--with-nccl=` *Function/variable/file names and locations And anything else you can think of. Let me know what you think!
Pull Request Description
This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:
CCLcomm
structuresrc/mpi/coll/mpir_coll.c
to considerCCL_Allreduce
.--with-nccl=
*Function/variable/file names and locations
And anything else you can think of. Let me know what you think!
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.