-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dlopen hijacking ignores rpath
#4001
Comments
@yosefe seems like ucm is breaking rpath |
when using cuda, it's requred to override dlopen() to hook cudaMalloc/Free etc. and cache the correct memory type (host/device) for all future loaded libraries |
So the problem is that the CUDA libraries don't have an appropriate interface? With the recent NVIDIA acquisition of Mellanox, isn't there a way to work out a better interface here than trying to re-implement a dynamic linker in a message passing library? |
I was wondering if one couldn't use (and cache) the result of |
@vchuravy we use cudaPointerGetAttributes when we disable pointer cache. we are using cache for better performance because of cudaPointerGetAttributes can have 0.2-05 us of overhead. |
Sure that make sense, but you could cache the results of |
@vchuravy It may not be reliable. if cudaFree() happens and same virtual address could be different memory type. |
@bureddy I think the check in Cuda also was very expensive (system call?), this is not something that we can do in communication path. |
Is there a way to hook into the cache, or another way for programs to provide the device/host info themselves? |
@simonbyrne Even if we introduce some sort of interface that will let us to indicate what memory type is used, there is no way to pass this through MPI interface. Let's assume for a second that you use UCX directly. The allocation of memory can happen in 3rd party library (that the language has no control) and you may not know origins of the memory. |
We are using CUDA-aware OpenMPI and run into the following failure scenario:
Looking at the output of
LD_DEBUG=all
.When loading MPI:
Without loading MPI:
The loader is attributing the
dlopen
tolibucm.so.0
instead oflibjulia.so.1
.The
RPATH
oflibjulia.so.1
is$ORIGIN:$ORIGIN/julia
, the second of which is the installation location oflibopenlibm
. Normally thedlopen
is done throughlibjulia
and theRPATH
is correctly picked up,We can fix this locally by disabling the memory hooks (as SLURM does) https://github.com/SchedMD/slurm/blob/5fe040f0cca02c8dc92e733e7b10d0067a9fed8a/src/plugins/mpi/pmix/pmixp_dconn_ucx.c#L151-L162
Why does
UCX
rewritedlopen
? That is incredibly invasive.cc: @simonbyrne
The text was updated successfully, but these errors were encountered: