dlopen hijacking ignores `rpath` #4001

vchuravy · 2019-08-03T11:00:49Z

We are using CUDA-aware OpenMPI and run into the following failure scenario:

julia -e 'ccall((:MPI_Init, :libmpi), Nothing, (Ptr{Cint},Ptr{Cint}), C_NULL, C_NULL); expm1(1.0)'
ERROR: could not load library "libopenlibm"
libopenlibm.so: cannot open shared object file: No such file or directory

Looking at the output of LD_DEBUG=all.
When loading MPI:

     56617:    file=libopenlibm.so [0];  dynamically loaded by /central/software/ucx/1.5.1_cuda-10.0/lib/libucm.so.0 [0]
     56617:    find library=libopenlibm.so [0]; searching
     56617:     search path=/central/software/CUDA/10.0/lib64        (LD_LIBRARY_PATH)
     56617:      trying file=/central/software/CUDA/10.0/lib64/libopenlibm.so
     56617:     search path=/central/software/julia/1.1.0/bin/../lib        (RPATH from file julia)
     56617:      trying file=/central/software/julia/1.1.0/bin/../lib/libopenlibm.so
     56617:     search path=/software/julia/1.1.0//lib:/central/software/OpenMPI/4.0.1_cuda-10.0//lib:/central/software/CUDA/10.0/lib64        (LD_LIBRARY_PATH)
     56617:      trying file=/software/julia/1.1.0//lib/libopenlibm.so
     56617:      trying file=/central/software/OpenMPI/4.0.1_cuda-10.0//lib/libopenlibm.so
     56617:      trying file=/central/software/CUDA/10.0/lib64/libopenlibm.so
     56617:     search cache=/etc/ld.so.cache
     56617:     search path=/lib64/tls:/lib64:/usr/lib64/tls:/usr/lib64        (system search path)
     56617:      trying file=/lib64/tls/libopenlibm.so
     56617:      trying file=/lib64/libopenlibm.so
     56617:      trying file=/usr/lib64/tls/libopenlibm.so
     56617:      trying file=/usr/lib64/libopenlibm.so

Without loading MPI:

     58221:    file=libopenlibm.so [0];  dynamically loaded by /central/software/julia/1.1.0/bin/../lib/libjulia.so.1 [0]
     58221:    find library=libopenlibm.so [0]; searching
     58221:     search path=/central/software/julia/1.1.0/bin/../lib/julia:/central/software/julia/1.1.0/bin/../lib        (RPATH from file /central/software/julia/1.1.0/bin/../lib/libjulia.so.1)
     58221:      trying file=/central/software/julia/1.1.0/bin/../lib/julia/libopenlibm.so
     58221:    
     58221:    file=libopenlibm.so [0];  generating link map
     58221:      dynamic: 0x00002aaad615fd80  base: 0x00002aaad5f32000   size: 0x000000000022f2d0
     58221:        entry: 0x00002aaad5f37290  phdr: 0x00002aaad5f32040  phnum:

The loader is attributing the dlopen to libucm.so.0 instead of libjulia.so.1.
The RPATH of libjulia.so.1 is $ORIGIN:$ORIGIN/julia, the second of which is the installation location of libopenlibm. Normally the dlopen is done through libjulia and the RPATH is correctly picked up,

We can fix this locally by disabling the memory hooks (as SLURM does) https://github.com/SchedMD/slurm/blob/5fe040f0cca02c8dc92e733e7b10d0067a9fed8a/src/plugins/mpi/pmix/pmixp_dconn_ucx.c#L151-L162

Why does UCX rewrite dlopen? That is incredibly invasive.

cc: @simonbyrne

The text was updated successfully, but these errors were encountered:

shamisp · 2019-08-03T14:23:25Z

@yosefe seems like ucm is breaking rpath

yosefe · 2019-08-03T18:01:03Z

when using cuda, it's requred to override dlopen() to hook cudaMalloc/Free etc. and cache the correct memory type (host/device) for all future loaded libraries
probably need to make dlopen aware of the rpath pf the original binary

Keno · 2019-08-03T19:45:45Z

So the problem is that the CUDA libraries don't have an appropriate interface? With the recent NVIDIA acquisition of Mellanox, isn't there a way to work out a better interface here than trying to re-implement a dynamic linker in a message passing library?

vchuravy · 2019-08-03T20:39:57Z

I was wondering if one couldn't use (and cache) the result of cudaPointerGetAttributes

bureddy · 2019-08-04T17:38:08Z

@vchuravy we use cudaPointerGetAttributes when we disable pointer cache. we are using cache for better performance because of cudaPointerGetAttributes can have 0.2-05 us of overhead.

vchuravy · 2019-08-04T17:55:11Z

Sure that make sense, but you could cache the results of cudaPointerGetAttributes instead of hijacking mmap and malloc.

bureddy · 2019-08-04T19:00:27Z

@vchuravy It may not be reliable. if cudaFree() happens and same virtual address could be different memory type.

shamisp · 2019-08-05T18:19:09Z

@bureddy I think the check in Cuda also was very expensive (system call?), this is not something that we can do in communication path.

simonbyrne · 2019-08-06T00:04:21Z

Is there a way to hook into the cache, or another way for programs to provide the device/host info themselves?

shamisp · 2019-08-06T14:15:24Z

@simonbyrne Even if we introduce some sort of interface that will let us to indicate what memory type is used, there is no way to pass this through MPI interface. Let's assume for a second that you use UCX directly. The allocation of memory can happen in 3rd party library (that the language has no control) and you may not know origins of the memory.

vchuravy mentioned this issue Aug 3, 2019

disable UCX memory hooks JuliaParallel/MPI.jl#298

Merged

shamisp assigned yosefe Aug 3, 2019

shamisp added the Bug label Aug 3, 2019

yosefe assigned hoopoepg Aug 6, 2019

hoopoepg mentioned this issue Aug 9, 2019

UCM/DLOPEN: added processing of rpath #4037

Merged

yosefe closed this as completed in #4037 Aug 15, 2019

simonbyrne mentioned this issue Apr 24, 2020

Segmentation fault with CUDA-aware MPI in Julia #5061

Open

vchuravy mentioned this issue Jul 12, 2023

Handle RPATH in user program KDE/heaptrack#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dlopen hijacking ignores `rpath` #4001

dlopen hijacking ignores `rpath` #4001

vchuravy commented Aug 3, 2019

shamisp commented Aug 3, 2019

yosefe commented Aug 3, 2019

Keno commented Aug 3, 2019

vchuravy commented Aug 3, 2019

bureddy commented Aug 4, 2019

vchuravy commented Aug 4, 2019

bureddy commented Aug 4, 2019

shamisp commented Aug 5, 2019

simonbyrne commented Aug 6, 2019

shamisp commented Aug 6, 2019

dlopen hijacking ignores rpath #4001

dlopen hijacking ignores rpath #4001

Comments

vchuravy commented Aug 3, 2019

shamisp commented Aug 3, 2019

yosefe commented Aug 3, 2019

Keno commented Aug 3, 2019

vchuravy commented Aug 3, 2019

bureddy commented Aug 4, 2019

vchuravy commented Aug 4, 2019

bureddy commented Aug 4, 2019

shamisp commented Aug 5, 2019

simonbyrne commented Aug 6, 2019

shamisp commented Aug 6, 2019

dlopen hijacking ignores `rpath` #4001

dlopen hijacking ignores `rpath` #4001