Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mjwilkins18
Copy link
Contributor

Pull Request Description

This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:

  • When/where to properly init/free the CCLcomm structure
  • How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)
  • How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)
  • How to modify src/mpi/coll/mpir_coll.c to consider CCL_Allreduce.
  • How to pull-in NCCL from the environment and/or a configure argument --with-nccl=
    *Function/variable/file names and locations

And anything else you can think of. Let me know what you think!

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@mjwilkins18 mjwilkins18 force-pushed the mpich_nccl branch 2 times, most recently from 4de713b to debe5b4 Compare February 7, 2025 19:21
Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.

MPIR_ERR_CHECK(mpi_errno);

if(!comm_ptr->cclcomm) {
mpi_errno = MPIR_CCLcomm_init(comm_ptr, comm_ptr->rank);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup on first use makes sense. the cleanup call to MPIR_CCLcomm_free is missing from this PR at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the free call to src/mpi/comm/commutil.c

* CCLcomm functions, currently tied to NCCL
*/

int MPIR_CCLcomm_init(MPIR_Comm * comm, int rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the other CCLcomm functions, you might want to move the NCCL-specifics to a MPIR_NCCL_ function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check how we managed PMI wrappers in src/util/mpir_pmi.c and src/util/mpir_pmi[x].inc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the NCCL-specific code is now separated into its own file and activated by its own macro ENABLE_NCCL

Comment on lines 9 to 32
#include <nccl.h>

typedef struct MPIR_CCLcomm {
MPIR_OBJECT_HEADER;
MPIR_Comm *comm;
ncclUniqueId id;
ncclComm_t ncclcomm;
cudaStream_t stream;
} MPIR_CCLcomm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need the right autoconf/automake defines to only include this stuff when its available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I added temporary macros to show where the ifdefs will go in the new files, but I am not confident how to modify the autotools files to properly detect CCLs. I would appreciate your help with this!

MPIR_ERR_NONE);
MPIR_ERR_CHECK(mpi_errno);

ret = cudaSetDevice(0); //TODO: Fix this for multi-GPU nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to move to coll invocation time and inherit whichever device the user-allocated buffer is located on. we could end up with multiple streams cached on the comm object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@raffenet
Copy link
Contributor

I think another question down the road is whether or not there is enough commonality in the CCL impls that it can be abstracted in src/mpi/src/gpu and just have a single MPIR_GPU_CCL interface that uses whichever GPU configuration is currently built-in.

@mjwilkins18
Copy link
Contributor Author

@raffenet Thanks for the comments! I will take a look at the individual feedback and make changes.

@mjwilkins18
Copy link
Contributor Author

look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.

For my testing, I added it to src/mpi/coll/mpir_coll.c, but I am not sure how to properly change an autogenerated file. Guidance here would be appreciated

@raffenet
Copy link
Contributor

look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.

For my testing, I added it to src/mpi/coll/mpir_coll.c, but I am not sure how to properly change an autogenerated file. Guidance here would be appreciated

Ah, right 😄. In the case of mpir_coll.c I think you first want to add a ccl algorithm to coll_algorithms.txt under allreduce-intra. Run python3 maint/gen_coll.py and that will generate the prototype you'll want to implement in your code.

##

mpi_core_sources += \
src/mpi/ccl/cclcomm.c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place it under src/util or src/mpid/common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logic we should use to decide where to store these new files? Note there are now a few new source files in this directory.

* CCLcomm functions, currently tied to NCCL
*/

int MPIR_CCLcomm_init(MPIR_Comm * comm, int rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check how we managed PMI wrappers in src/util/mpir_pmi.c and src/util/mpir_pmi[x].inc

CUDA_ERR_CHECK(ret);
ret = cudaStreamCreate(&(cclcomm->stream));
CUDA_ERR_CHECK(ret);
ret = ncclCommInitRank(&(cclcomm->ncclcomm), comm_size, cclcomm->id, rank);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NCCL work with duplicate/shared devices? Does it work across multiple nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know NVIDIA has released some new features recently to enable multiple processes to share a GPU, but I think it is rare. My expectation is that this feature (and GPU-enabled MPI in general) is one process per GPU.

NCCL does work across multiple nodes. It natively supports infiniband and some others natively and there are plugins for other transports like OFI.

@hzhou
Copy link
Contributor

hzhou commented Feb 12, 2025

When/where to properly init/free the CCLcomm structure

Lazy initialization should work.

How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)

After #7264, you only need deal with internal datatypes which can be placed into a switch clause.

How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)

Something similar to how we wrap the PMIs (src/util/mpir_pmi.c)

How to modify src/mpi/coll/mpir_coll.c to consider CCL_Allreduce.

  1. make a new algorithm that
    • restricts on inputs (gpu only, builtin-datatype only, etc) and add selection logic to Json, or -
    • make it a semi-auto algorithm where dispatch to CCL algorithm where it makes sense and fallback where doesn't make sense.
  2. Add CVAR and Json

How to pull-in NCCL from the environment and/or a configure argument --with-nccl= *Function/variable/file names and locations

--with-nccl= should work

@mjwilkins18
Copy link
Contributor Author

mjwilkins18 commented Feb 26, 2025

I have updated this prototype and am re-requesting reviews. The implementation is now much more fleshed out. I separated the CCL framework and the NCCL calls, and I have integrated them in the allreduce path via a new "ccl" algorithm. The main thing missing at this point is the autoconf/automake detection of the NCCL path. If someone can point me in the right direction on this, I would greatly appreciate it!

Regarding:

  1. make a new algorithm that

    • restricts on inputs (gpu only, builtin-datatype only, etc) and add selection logic to Json, or -
    • make it a semi-auto algorithm where dispatch to CCL algorithm where it makes sense and fallback where doesn't make sense.

It is currently implemented as the second option, but long-term I would like to do the first.

Let me know what you think!

Mike Wilkins and others added 6 commits March 3, 2025 15:46
This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:
* When/where to properly init/free the `CCLcomm` structure
* How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)
* How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)
* How to modify `src/mpi/coll/mpir_coll.c` to consider `CCL_Allreduce`.
* How to pull-in NCCL from the environment and/or a configure argument `--with-nccl=`
*Function/variable/file names and locations

And anything else you can think of. Let me know what you think!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants