Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nompi stub #425

Open
wants to merge 57 commits into
base: develop
Choose a base branch
from
Open

Add nompi stub #425

wants to merge 57 commits into from

Conversation

bobpaw
Copy link
Collaborator

@bobpaw bobpaw commented Apr 26, 2024

Add nompi stub

Add a stub implementation of MPI in pcu/pcu_pnompi.c.
Remove direct calls to MPI functions.

This adds a SCOREC_NO_MPI build option for users running on single-process, single-machine setups that do not want to compile with MPI. Message passing is simulated in memory.

@bobpaw
Copy link
Collaborator Author

bobpaw commented Apr 26, 2024

@cwsmith, build is failing for this PR. I think we should add PCU_Init and PCU_Finalize facades that wrap the MPI calls and do a PCU_Comm_Init() and PCU_Comm_Free() to remove the direct dependencies. There are a lot of executables that would need to be changed and this would become a large PR. Alternatively, we could define a replacement MPI_Init and MPI_Finalize in pcu_pnompi and link to those. Or, we could do a hybrid approach and encourage people to use PCU_Init in the future.

@jacobmerson
Copy link
Contributor

@flagdanger and I are working on a very large update of PCU that should make this all easier by getting rid of the singleton. Pull request for that coming within the next week.

@jacobmerson
Copy link
Contributor

@bobpaw sorting out a few things to make the pull request ready to merge, but #388 the draft pull request for replacing the singleton in PCU.

Although we replaced the interface used in the test cases, this pull request is designed to work with no modification to existing codes by using a global PCU state when the old style interface is used with PCU_Comm_Init or it can use local state.

@bobpaw
Copy link
Collaborator Author

bobpaw commented Apr 30, 2024

@jacobmerson, we will plan to use your PR for the future of core, but the sponsor wants this functionality soon, so we will use this branch until yours is finished.

@cwsmith
Copy link
Contributor

cwsmith commented Apr 30, 2024

@bobpaw Does PCU_Comm_Init() check if MPI was initialized?

@bobpaw
Copy link
Collaborator Author

bobpaw commented Apr 30, 2024

@bobpaw Does PCU_Comm_Init() check if MPI was initialized?

No, at the moment it only checks if PCU has been initialized using a global state variable. We could check with MPI_Initialized.

@onkarsahni
Copy link
Contributor

@cwsmith Note Aiden is only doing this when SCOREC_NO_MPI is defined to be true, so there is no actual MPI (e.g., see pcu/pcu_pnompi.c). For now, he is not looking into PCU_init, that will be done after the other PR #388.

@cwsmith
Copy link
Contributor

cwsmith commented Apr 30, 2024

@bobpaw OK. I'm wondering if it will make sense to make the MPI check in that call, and if it fails call MPI_Init. But... handling MPI_Finalize could be a bit more tricky; other libraries may still be running and using MPI beyond the lifetime of PCU. I'd prefer symmetry in the interface so your original suggestion may be best.

@bobpaw bobpaw marked this pull request as ready for review May 6, 2024 15:13
Added pcu_pnompi.c.
Added SCOREC_NO_MPI CMake option.
Removed direct MPI calls in parma, phasta, and PUMI.
Added incomplete reimplementations of those MPI calls.
Add stubs for MPI_Init and MPI_Finalize.
Add pcu_pnompi_types.h for MPI typedefs.
Add status message about SCOREC_NO_MPI flag.

Signed-off-by: Aiden Woodruff <[email protected]>
Replace MPI_Comm_free with PCU_Comm_Free_One.

Signed-off-by: Aiden Woodruff <[email protected]>
This will avoid object collisions when linking to libraries that already use MPI.
- i.e. the intended use case for SCOREC_NO_MPI.

Signed-off-by: Aiden Woodruff <[email protected]>
Changed add_nompi_msg to record receiver.
- Since there's no MPI, sender == receiver but I want to quiet the warnings.

Signed-off-by: Aiden Woodruff <[email protected]>
Add return type for pcu_pmpi_free, pcu_pmpi_split.

Signed-off-by: Aiden Woodruff <[email protected]>
Add linked list traversal to find a matching message.
Removed error message that did not actually indicate errors.
- If no message is found it's perfectly reasonable and in fact correct to
  return 0 and that's how pmpi knows to return false.

Signed-off-by: Aiden Woodruff <[email protected]>
Run smoke tests without mpirun given SCOREC_NO_MPI.
Add SCOREC_NO_MPI to the GitHub Actions test matrix.

Signed-off-by: Aiden Woodruff <[email protected]>
I use clock_gettime, which is one of the many methods mpich uses depending on
the system.

Signed-off-by: Aiden Woodruff <[email protected]>
Also update TriBITS package file.

Signed-off-by: Aiden Woodruff <[email protected]>
- Add mpi_test_depends function to replace set_test_properties. This
  function checks if SCOREC_NO_MPI is defined and assumes nonexistent
  tests were multiproc ones.

Signed-off-by: Aiden Woodruff <[email protected]>
- parma/group/parma_group.cc(runInGroups): remove PCU/MPI free that will
  be carried out by pcu::PCU::~PCU.
- test/xgc_split.cc: remove direct mpi.h inclusion
- test/pumiLoadMesh.cc: remove direct mpi.h inclusion.
- test/modelInfo.cc: remove direct #include <mpi.h>.
- phasta/phstream.cc(::getTime): use new pcu::Time.
- test/fieldReduce.cc: use correct PCUObj.Peers.
- pumi/pumi_sys.cc(pumi_sync): use PCU Barrier instead of MPI Barrier.
- pcu/PCU.h: make PCU_Comm (holdover) functions extern C and PCU_Wtime.
- pumi/pumi_ghost.cc(do_off_part_bridge): fix typo (forgot parens).
- pumi/pumi_mesh.cc: capitalize GetMPIComm.
- pumi/pumi_numbering(pumi_numbering_print): use pcu::PCU::Barrier
  instead of WORLD barrier.
- pcu/PCU_C.h: optionally include mpi or stub.
- pcu/pcu_c.cc: add split stub and include pcu_pmpi.h.
- pcu/pcu_pmpi.c: remove leftover conflict markers.
- pcu/pcu_pnompi.h: update function signatures, make extern C, and
  remove obsolete stubs.
- pcu/pcu_pnompi.c: remove globals and use pcu_mpi_t*.
- clean up formatting.
- remove unused or (void)ed argument names.

Signed-off-by: Aiden Woodruff <[email protected]>
- pcu/PCU.cc(PCU::~PCU): clean up message order if it exists (valgrind
  is happy now).

Signed-off-by: Aiden Woodruff <[email protected]>
- pcu/pcu_pmpi.c: remove pcu_pmpi_switch, pcu_pmpi_comm which were used
  by old nompi stubs.
- (pcu_pmpi_allgather, pcu_pmpi_allreduce): break long lines and clean
  up indentation.

Signed-off-by: Aiden Woodruff <[email protected]>
@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 19, 2025

I compared the CTest suite resultant meshes in pumi-meshes and build/test/ and they are identical except for build/test/mis_test. However, MIS uses random numbers, so I am figuring they are otherwise identical.

@onkarsahni said:

When srand is used, one should provide option to use a fixed seed (for debugging, testing, regression, etc.) and a different seed (like based on time function), see https://github.com/SCOREC/core/blob/develop/apf/apfMIS.cc#L11.

- example/mpi-nompi/test_coll.cc: add new executable to run the same
  tests as hello world but with assertions.
- example/mpi-nompi/CMakeLists.txt: use new test_coll which allows the
  removal of regexes and no temporal non-determinism.

Signed-off-by: Aiden Woodruff <[email protected]>
Copy link
Contributor

@jacobmerson jacobmerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly focused on the changes in the C++ and CMake code and did not review the C implementation in detail.

I have one major concern that we need to clarify before merging is a more concrete understanding of the semantics of the class's comm ownership. I'm currently against a shared ownership model for this particular case. And, highly against anything that puts the onus of remembering to clean things up on the user. Is there a simple case that can demonstrate why you needed to add the ownscomm logic?

The other more minor concern is that we should get rid of the need for #ifdefs in the user code by providing our own init and finalize routines. That will substantially clean up the executables.

The rest is very minor stuff that should be quick to update.

Otherwise, I think this is very much moving in the right direction. The changes clean up a fair number of places where we have a mix of MPI and PCU use which is bound to cause issues.

- example/mpi-nompi/hello.cc: use std::vector instead of new/delete.
- replace print_array with print_vector.
- example/mpi-nompi/test_coll.cc: use std::vector instead of new/delete.

Signed-off-by: Aiden Woodruff <[email protected]>
@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 20, 2025

The other more minor concern is that we should get rid of the need for #ifdefs in the user code by providing our own init and finalize routines. That will substantially clean up the executables.

I was considering adding a specialized PCU::PCU(int *argc, char ***argv) which would do MPI_Init and save a flag to run MPI_Finalize later. That specialization would allow users to avoid introducing a scope just for the PCU object because MPI_Finalize will be guaranteed to be called after any pcu_finalize, etc.

Outside of something like that, what is so bad about having #ifdefs? When we discussed, we thought it would help to make it clear what is going on to people looking at tests as examples.

@cwsmith
Copy link
Contributor

cwsmith commented Feb 20, 2025

Are the #ifdef SCOREC_NO_MPI conditionals only used in the examples/drivers?

@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 20, 2025

Are the #ifdef SCOREC_NO_MPI conditionals only used in the examples/drivers?

@cwsmith, the places with #ifdef SCOREC_NO_MPI are: tests (including "utils"), some exes in phasta/, pcu/pcu_defines.h, and pcu/PCU.h (for pcu::Time and the default constructor).

@jacobmerson
Copy link
Contributor

The thing about ifdefs is it can make the code much more unreadable and the user needs to know under the hood whether they should have MPI or not. If we wrap those ifdefs in our own initialize function then we take care of that subtlety for them.

I was considering adding a specialized PCU::PCU(int *argc, char ***argv) which would do MPI_Init and save a flag to run MPI_Finalize later. That specialization would allow users to avoid introducing a scope just for the PCU object because MPI_Finalize will be guaranteed to be called after any pcu_finalize, etc.

I don't like this solution. The reason we have the PCU object is that we may have many of them lying around with different communicators under the hood, but ultimately still using the same MPI library. I have use cases where this is needed. That constructor means the PCU object may or may not own the MPI initialization. What happens when the owning PCU goes out of scope? All other PCUs that are still alive would break as they would be doing MPI operations after MPI_finalize.

So, there needs to be some library/MPI initialization at a higher level than the PCU object.

@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 21, 2025

What happens when the owning PCU goes out of scope? All other PCUs that are still alive would break as they would be doing MPI operations after MPI_finalize.

Shouldn't the world-initializing PCU only go out of scope at the end of the main function? Shouldn't all other objects have been destructed by then?

@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 21, 2025

I am not touching apfCGNS code yet because pcgnslib.h has #include "mpi.h" and therefore CGNS is incompatible with SCOREC_NO_MPI. But I will need to add CMake logic related to the incompatibility and update apfCGNS to not use GetMPIComm.

@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 21, 2025

We determined that instead of PCU::OwnsComm, we can remove pcu_mpi_t::original_comm entirely and also remove PCU::GetMPIComm and PCU::SwitchMPIComm since those functions only existed because we had to shuffle the MPI_Comm around in the singleton PCU. With the new pcu::PCU C++ object, we do not need such functions.

- pcu/PCU.h: add pcu::PCU_Init and finalize declarations.
- pcu/PCU.cc: add definitions.
- test/*.cc: replace MPI_Init with pcu::PCU_Init and finalize
- phasta/*.cc: replace MPI_Init with PCU_Init.

Signed-off-by: Aiden Woodruff <[email protected]>
- example/mpi-nompi: use static pcu object instead of dynamically allocating.
- test/cap*.cc: use static pcu object instead of dynamic allocation.

Signed-off-by: Aiden Woodruff <[email protected]>
- example/mpi-nompi/hello.cc: update call using PCUObj address.

Signed-off-by: Aiden Woodruff <[email protected]>
- pcu/PCU.h: include <memory> and change PCU::Split signature to
  return std::unique_ptr.
- pcu/PCU.cc: return unique_ptr in PCU::Split.
- parma/group/parma_group.cc: update PCU::Split call.
- phasta/phCook.cc: update to use unique_ptr. remove delete.
- pumi/pumi_mesh.cc: update to use unique_ptr and remove delete.
- test/repartition.cc: update to use unique_ptr.
- test/split.cc: inline getGroupedPCU() and update to use unique_ptr.
- test/ugrid.cc: inline getGroupedPCU() and update to use unique_ptr.
- test/zsplit.cc: inline getGroupedPCU() and update to use unique_ptr.

Signed-off-by: Aiden Woodruff <[email protected]>
- pcu/PCU.h: add PCU::DupComm.
- pcu/PCU.cc: add DupComm definition.
- pcu/pcu_mpi.[ch]: add pcu_mpi layer.
- pcu/pcu_pmpi.[ch]: add pcu_pmpi layer.
- pcu/pcu_pnompi.c: add stub pcu_pmpi_dup implementation.

Signed-off-by: Aiden Woodruff <[email protected]>
@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 24, 2025

SCOREC_NO_MPI is also likely exclusive with ENABLE_ZOLTAN since zoltan.h has #include <mpi.h>.

@cwsmith
Copy link
Contributor

cwsmith commented Feb 25, 2025

SCOREC_NO_MPI is also likely exclusive with ENABLE_ZOLTAN since zoltan.h has #include <mpi.h>.

Hmmm. Zoltan can be used locally on each process (see the MPI_COMM = SELF) to do what we called 'local part splitting' but that use case then relies on migrating elements to other processes via ... MPI. So, yes, ENABLE_ZOLTAN requires MPI. Good catch.

@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 25, 2025

For when I update apfZoltanCallbacks.cc,

https://github.com/sandialabs/Zoltan/blob/f6361719dd66cac62db8dbed120704e436a5ee81/src/zz/zz_struct.c#L114-L139

Zoltan duplicates the comm, so I should free my reference after Zoltan create.

- pcu/PCU.h: remove GetMPIComm, GetComm, SwitchMPIComm, SwitchComm,
  OwnsComm, OwnsComm(bool).
- (PCU::Split): change return type to int and set output pointer.
- update documentation.
- pcu/PCU.cc: remove GetMPIComm, GetComm, SwitchMPIComm, SwitchComm,
  OwnsComm.
- (PCU::Split): remove OwnsComm call.
- (PCU::DupComm): return int and set pointer.
- pcu/pcu_mpi.h: remove pcu_mpi_struct original_comm and owned.
- pcu/pcu_pmpi.c (pcu_pmpi_init): do not copy original comm and remove
  self->owned.
- (pcu_pmpi_finalize): remove MPI_Comm_free for owned original_comm.
- (pcu_pmpi_split): split over user_comm.
- pcu/pcu_pnompi.c: update init and finalize.
- pcu/PCU_C.h: remove Switch_Comm and Get_Comm.
- add Comm_Dup and Comm_Split.
- pcu/pcu_c.cc: remove Switch_Comm and Get_Comm. add Comm_Dup with
  pointer to newpcu that releases the unique_ptr and Comm_Split.

Signed-off-by: Aiden Woodruff <[email protected]>
- CMakeLists.txt: add error when SCOREC_NO_MPI and ENABLE_CGNS.
- zoltan/CMakeLists.txt: add error when SCOREC_NO_MPI and ENABLE_ZOLTAN.

Signed-off-by: Aiden Woodruff <[email protected]>
- zoltan/apfZoltanCallbacks.cc: use DupComm and free our duplicate after
  Zoltan_Create.

Signed-off-by: Aiden Woodruff <[email protected]>
- pumi/pumi_mesh.cc (pumi_mesh_load): switch back to original PCU object
  instead of split (fixing error introduced by 46be470).

Signed-off-by: Aiden Woodruff <[email protected]>
@bobpaw
Copy link
Collaborator Author

bobpaw commented Feb 25, 2025

@cwsmith, I'm running into a subtle bug that I'm hoping you can help with. The test case pumi3d-1p is failing intermittently (1/4 or 1/5 of the time) in this new branch. I believe the culprit is a change made in pumi/pumi_mesh.cc, but I cannot figure out how my changes caused this issue. The test is only activated given ENABLE_ZOLTAN, and since this code uses apf::createZoltanSplitter, it is doing the MPI_COMM_SELF splitting like you described. I can additionally share logs of the failing test.

New code:

core/pumi/pumi_mesh.cc

Lines 213 to 241 in 0069474

pMesh pumi_mesh_load(pGeom g, const char* filename, int num_in_part, const char* mesh_type)
{
pcu::PCU *pcu_obj = pumi::instance()->getPCU();
if (strcmp(mesh_type,"mds"))
{
if (!pcu_obj->Self()) std::cout<<"[PUMI ERROR] "<<__func__<<" failed: invalid mesh type "<<mesh_type<<"\n";
return NULL;
}
if (num_in_part==1 && pcu_obj->Peers()>1) // do static partitioning
{
int num_target_part = pcu_obj->Peers()/num_in_part;
bool isMaster = ((pcu_obj->Self() % num_target_part) == 0);
pMesh m = 0;
apf::Migration* plan = 0;
std::unique_ptr<pcu::PCU> split_comm = pcu_obj->Split(
pcu_obj->Self() % num_target_part, pcu_obj->Self() / num_target_part
);
if (isMaster) {
m = apf::loadMdsMesh(g->getGmi(), filename, split_comm.get());
plan = getPlan(m, num_target_part);
}
if (m != nullptr) m->switchPCU(pcu_obj);
pumi::instance()->mesh = apf::repeatMdsMesh(m, g->getGmi(), plan, num_target_part, pcu_obj);
}
else
pumi::instance()->mesh = apf::loadMdsMesh(g->getGmi(), filename, pcu_obj);
pumi_mesh_print(pumi::instance()->mesh);
return pumi::instance()->mesh;
}

Old code:

core/pumi/pumi_mesh.cc

Lines 234 to 260 in a900c47

pMesh pumi_mesh_load(pGeom g, const char* filename, int num_in_part, const char* mesh_type)
{
if (strcmp(mesh_type,"mds"))
{
if (!pumi::instance()->getPCU()->Self()) std::cout<<"[PUMI ERROR] "<<__func__<<" failed: invalid mesh type "<<mesh_type<<"\n";
return NULL;
}
if (num_in_part==1 && pumi::instance()->getPCU()->Peers()>1) // do static partitioning
{
MPI_Comm prevComm = pumi::instance()->getPCU()->GetMPIComm();
int num_target_part = pumi::instance()->getPCU()->Peers()/num_in_part;
bool isMaster = ((pumi::instance()->getPCU()->Self() % num_target_part) == 0);
pMesh m = 0;
apf::Migration* plan = 0;
split_comm(num_target_part, *pumi::instance()->getPCU());
if (isMaster) {
m = apf::loadMdsMesh(g->getGmi(), filename, pumi::instance()->getPCU());
plan = getPlan(m, num_target_part);
}
merge_comm(prevComm, *pumi::instance()->getPCU());
pumi::instance()->mesh = apf::repeatMdsMesh(m, g->getGmi(), plan, num_target_part, pumi::instance()->getPCU());
}
else
pumi::instance()->mesh = apf::loadMdsMesh(g->getGmi(), filename, pumi::instance()->getPCU());
pumi_mesh_print(pumi::instance()->mesh);
return pumi::instance()->mesh;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants