[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

mjwilkins18 · 2025-02-07T19:18:55Z

Pull Request Description

This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:

When/where to properly init/free the CCLcomm structure
How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)
How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)
How to modify src/mpi/coll/mpir_coll.c to consider CCL_Allreduce.
How to pull-in NCCL from the environment and/or a configure argument --with-nccl=
*Function/variable/file names and locations

And anything else you can think of. Let me know what you think!

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

raffenet

look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.

raffenet · 2025-02-11T16:06:42Z

src/mpi/ccl/cclcomm.c

+    MPIR_ERR_CHECK(mpi_errno);
+
+    if(!comm_ptr->cclcomm) {
+      mpi_errno = MPIR_CCLcomm_init(comm_ptr, comm_ptr->rank);


setup on first use makes sense. the cleanup call to MPIR_CCLcomm_free is missing from this PR at the moment.

I added the free call to src/mpi/comm/commutil.c

raffenet · 2025-02-11T16:08:01Z

src/mpi/ccl/cclcomm.c

+ * CCLcomm functions, currently tied to NCCL
+ */
+
+int MPIR_CCLcomm_init(MPIR_Comm * comm, int rank)


To be consistent with the other CCLcomm functions, you might want to move the NCCL-specifics to a MPIR_NCCL_ function.

You can check how we managed PMI wrappers in src/util/mpir_pmi.c and src/util/mpir_pmi[x].inc

All of the NCCL-specific code is now separated into its own file and activated by its own macro ENABLE_NCCL

src/mpi/ccl/cclcomm.c

raffenet · 2025-02-11T16:13:14Z

src/include/mpir_cclcomm.h

+#include <nccl.h>
+
+typedef struct MPIR_CCLcomm {
+    MPIR_OBJECT_HEADER;
+    MPIR_Comm *comm;
+    ncclUniqueId id;
+    ncclComm_t ncclcomm;
+    cudaStream_t stream;
+} MPIR_CCLcomm;


will need the right autoconf/automake defines to only include this stuff when its available.

I agree. I added temporary macros to show where the ifdefs will go in the new files, but I am not confident how to modify the autotools files to properly detect CCLs. I would appreciate your help with this!

raffenet · 2025-02-11T16:18:24Z

src/mpi/ccl/cclcomm.c

+                                    MPIR_ERR_NONE);
+    MPIR_ERR_CHECK(mpi_errno);
+
+    ret = cudaSetDevice(0); //TODO: Fix this for multi-GPU nodes


probably want to move to coll invocation time and inherit whichever device the user-allocated buffer is located on. we could end up with multiple streams cached on the comm object.

raffenet · 2025-02-11T16:25:39Z

I think another question down the road is whether or not there is enough commonality in the CCL impls that it can be abstracted in src/mpi/src/gpu and just have a single MPIR_GPU_CCL interface that uses whichever GPU configuration is currently built-in.

mjwilkins18 · 2025-02-11T16:32:16Z

@raffenet Thanks for the comments! I will take a look at the individual feedback and make changes.

mjwilkins18 · 2025-02-11T16:34:14Z

look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.

For my testing, I added it to src/mpi/coll/mpir_coll.c, but I am not sure how to properly change an autogenerated file. Guidance here would be appreciated

raffenet · 2025-02-11T17:30:26Z

look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.

For my testing, I added it to src/mpi/coll/mpir_coll.c, but I am not sure how to properly change an autogenerated file. Guidance here would be appreciated

Ah, right 😄. In the case of mpir_coll.c I think you first want to add a ccl algorithm to coll_algorithms.txt under allreduce-intra. Run python3 maint/gen_coll.py and that will generate the prototype you'll want to implement in your code.

hzhou · 2025-02-12T19:46:36Z

src/mpi/ccl/Makefile.mk

+##
+
+mpi_core_sources += \
+    src/mpi/ccl/cclcomm.c


Place it under src/util or src/mpid/common

What is the logic we should use to decide where to store these new files? Note there are now a few new source files in this directory.

hzhou · 2025-02-12T19:48:06Z

src/mpi/ccl/cclcomm.c

+ * CCLcomm functions, currently tied to NCCL
+ */
+
+int MPIR_CCLcomm_init(MPIR_Comm * comm, int rank)


You can check how we managed PMI wrappers in src/util/mpir_pmi.c and src/util/mpir_pmi[x].inc

hzhou · 2025-02-12T19:54:00Z

src/mpi/ccl/cclcomm.c

+    CUDA_ERR_CHECK(ret);
+    ret = cudaStreamCreate(&(cclcomm->stream));
+    CUDA_ERR_CHECK(ret);
+    ret = ncclCommInitRank(&(cclcomm->ncclcomm), comm_size, cclcomm->id, rank);


Does NCCL work with duplicate/shared devices? Does it work across multiple nodes?

I know NVIDIA has released some new features recently to enable multiple processes to share a GPU, but I think it is rare. My expectation is that this feature (and GPU-enabled MPI in general) is one process per GPU.

NCCL does work across multiple nodes. It natively supports infiniband and some others natively and there are plugins for other transports like OFI.

hzhou · 2025-02-12T20:03:04Z

When/where to properly init/free the CCLcomm structure

Lazy initialization should work.

How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)

After #7264, you only need deal with internal datatypes which can be placed into a switch clause.

How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)

Something similar to how we wrap the PMIs (src/util/mpir_pmi.c)

How to modify src/mpi/coll/mpir_coll.c to consider CCL_Allreduce.

make a new algorithm that
- restricts on inputs (gpu only, builtin-datatype only, etc) and add selection logic to Json, or -
- make it a semi-auto algorithm where dispatch to CCL algorithm where it makes sense and fallback where doesn't make sense.
Add CVAR and Json

How to pull-in NCCL from the environment and/or a configure argument --with-nccl= *Function/variable/file names and locations

--with-nccl= should work

mjwilkins18 · 2025-02-26T15:33:58Z

I have updated this prototype and am re-requesting reviews. The implementation is now much more fleshed out. I separated the CCL framework and the NCCL calls, and I have integrated them in the allreduce path via a new "ccl" algorithm. The main thing missing at this point is the autoconf/automake detection of the NCCL path. If someone can point me in the right direction on this, I would greatly appreciate it!

Regarding:

make a new algorithm that

restricts on inputs (gpu only, builtin-datatype only, etc) and add selection logic to Json, or -

make it a semi-auto algorithm where dispatch to CCL algorithm where it makes sense and fallback where doesn't make sense.

It is currently implemented as the second option, but long-term I would like to do the first.

Let me know what you think!

This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have: * When/where to properly init/free the `CCLcomm` structure * How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes) * How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.) * How to modify `src/mpi/coll/mpir_coll.c` to consider `CCL_Allreduce`. * How to pull-in NCCL from the environment and/or a configure argument `--with-nccl=` *Function/variable/file names and locations And anything else you can think of. Let me know what you think!

mjwilkins18 requested review from hzhou, raffenet and yfguo February 7, 2025 19:19

mjwilkins18 force-pushed the mpich_nccl branch 2 times, most recently from 4de713b to debe5b4 Compare February 7, 2025 19:21

raffenet reviewed Feb 11, 2025

View reviewed changes

hzhou reviewed Feb 12, 2025

View reviewed changes

mjwilkins18 force-pushed the mpich_nccl branch from debe5b4 to ee76028 Compare February 13, 2025 20:10

mjwilkins18 marked this pull request as draft February 13, 2025 20:10

carsten-clauss mentioned this pull request Feb 19, 2025

Add support for Unified Collective Communication (UCC) #7305

Open

mjwilkins18 force-pushed the mpich_nccl branch 2 times, most recently from b237081 to 645aef1 Compare February 25, 2025 21:30

mjwilkins18 requested review from raffenet and hzhou February 26, 2025 15:34

mjwilkins18 force-pushed the mpich_nccl branch from e8f1029 to 5c5940e Compare March 3, 2025 15:40

Mike Wilkins and others added 6 commits March 3, 2025 15:46

Updated based on comments, ccl now an Allreduce algorithm

250b7eb

Converted CCL into an algorithm and other fixes based on feedback

8c78fc1

Removing BFLOAT16 support b/c it is not in current main

67a1d00

Switched to using MPIR_DATATYPE_GET_RAW_INTERNAL

e5a1f12

Handle in-place operations

4169120

mjwilkins18 force-pushed the mpich_nccl branch from a31df0f to 4169120 Compare March 3, 2025 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

mjwilkins18 commented Feb 7, 2025

raffenet left a comment

raffenet Feb 11, 2025

mjwilkins18 Feb 26, 2025

raffenet Feb 11, 2025

hzhou Feb 12, 2025

mjwilkins18 Feb 26, 2025

raffenet Feb 11, 2025

mjwilkins18 Feb 26, 2025

raffenet Feb 11, 2025

mjwilkins18 Feb 26, 2025

raffenet commented Feb 11, 2025

mjwilkins18 commented Feb 11, 2025

mjwilkins18 commented Feb 11, 2025

raffenet commented Feb 11, 2025

hzhou Feb 12, 2025

mjwilkins18 Feb 26, 2025

hzhou Feb 12, 2025

hzhou Feb 12, 2025

mjwilkins18 Feb 26, 2025

hzhou commented Feb 12, 2025 •

edited

Loading

mjwilkins18 commented Feb 26, 2025 •

edited

Loading

[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

Are you sure you want to change the base?

[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

Conversation

mjwilkins18 commented Feb 7, 2025

Pull Request Description

Author Checklist

raffenet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raffenet commented Feb 11, 2025

mjwilkins18 commented Feb 11, 2025

mjwilkins18 commented Feb 11, 2025

raffenet commented Feb 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzhou commented Feb 12, 2025 • edited Loading

mjwilkins18 commented Feb 26, 2025 • edited Loading

hzhou commented Feb 12, 2025 •

edited

Loading

mjwilkins18 commented Feb 26, 2025 •

edited

Loading