Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI tests fail on pristine Linux Mint 19 system (64-bit) #216

Closed
ziotom78 opened this issue Sep 28, 2018 · 12 comments · Fixed by #217
Closed

MPI tests fail on pristine Linux Mint 19 system (64-bit) #216

ziotom78 opened this issue Sep 28, 2018 · 12 comments · Fixed by #217

Comments

@ziotom78
Copy link
Contributor

I am experiencing troubles in installing MPI on my machine (Linux Mint 19 64-bit). I tried to isolate the environment by creating a pristine virtual machine using the same system and ran the following commands (after having downloaded Julia 0.7 from the site):

sudo apt install gfortran cmake openmpi-bin libopenmpi-dev

Installing MPI succeeds, but tests fail:

(v0.7) pkg> test MPI
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Testing MPI
    Status `/tmp/tmpEEJNfv/Manifest.toml`
  [9e28174c] BinDeps v0.8.10
  [34da2185] Compat v1.2.0
  [da04e1cc] MPI v0.7.1
  [30578b45] URIParser v0.4.0
  [2a0f44e3] Base64  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Base64`]
  [ade2ca70] Dates  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Dates`]
  [8bb1440f] DelimitedFiles  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/DelimitedFiles`]
  [8ba89e20] Distributed  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Distributed`]
  [b77e0a4c] InteractiveUtils  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/InteractiveUtils`]
  [76f85450] LibGit2  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/LibGit2`]
  [8f399da3] Libdl  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Libdl`]
  [37e2e46d] LinearAlgebra  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/LinearAlgebra`]
  [56ddb016] Logging  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Logging`]
  [d6f4376e] Markdown  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Markdown`]
  [a63ad114] Mmap  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Mmap`]
  [44cfe95a] Pkg  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Pkg`]
  [de0858da] Printf  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Printf`]
  [3fa0cd96] REPL  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/REPL`]
  [9a3f8284] Random  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Random`]
  [ea8e919c] SHA  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/SHA`]
  [9e88b42a] Serialization  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Serialization`]
  [1a1011a3] SharedArrays  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/SharedArrays`]
  [6462fe0b] Sockets  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Sockets`]
  [2f01184e] SparseArrays  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/SparseArrays`]
  [10745b16] Statistics  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Statistics`]
  [8dfed614] Test  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Test`]
  [cf7118a7] UUIDs  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/UUIDs`]
  [4ec0a83e] Unicode  [`/usr/local/julia-0.7.0/bin/../share/julia/stdlib/v0.7/Unicode`]
Running MPI.jl tests
[tompad:20263] mca_base_component_repository_open: unable to open mca_patcher_overwrite: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_patcher_overwrite.so: undefined symbol: mca_patcher_base_patch_t_class (ignored)
[tompad:20263] mca_base_component_repository_open: unable to open mca_shmem_posix: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_shmem_posix.so: undefined symbol: opal_shmem_base_framework (ignored)
[tompad:20263] mca_base_component_repository_open: unable to open mca_shmem_sysv: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
[tompad:20263] mca_base_component_repository_open: unable to open mca_shmem_mmap: /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[tompad:20263] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

followed by many similar errors. I have tried to investigate this stuff, but had no clues in understanding what's causing it. (A student of mine running Ubuntu 18.04 is experiencing the same behaviour.)

@lcw
Copy link
Member

lcw commented Sep 28, 2018

I am wondering if bb656f8 is causing this issue. Can you try v0.7.0 of MPI.jl?

@barche
Copy link
Collaborator

barche commented Sep 28, 2018

Yes, I may have concluded to hastily that the RTLD_GLOBAL flag was no longer necessary, I don't have a Ubuntu system to test this on.

@ziotom78
Copy link
Contributor Author

@lcw, thanks for the suggestion, indeed MPI 0.7.0 works flawlessly. Interestingly, I found that installing mpich instead of OpenMPI makes MPI.jl work:

sudo apt remove openmpi-bin libopenmpi-dev && sudo apt install mpich libmpich-dev

@lcw
Copy link
Member

lcw commented Oct 2, 2018

@ziotom78 Thanks for the update.

@barche, @vchuravy, and others, Should we revert bb656f8? Are there other options?

@barche
Copy link
Collaborator

barche commented Oct 2, 2018

I think it should be sufficient to just keep the Libdl.dlopen(libmpi, Libdl.RTLD_LAZY | Libdl.RTLD_GLOBAL) inside the init function, the eval breaks precompilation of dependent packages if we just revert it.

lcw added a commit to lcw/MPI.jl that referenced this issue Oct 2, 2018
lcw added a commit to lcw/MPI.jl that referenced this issue Oct 2, 2018
@vchuravy vchuravy reopened this Oct 4, 2018
@vchuravy
Copy link
Member

vchuravy commented Oct 4, 2018

Still happens with 0.7.2 https://travis-ci.org/JuliaSmoothOptimizers/MUMPS.jl/jobs/436952405

x-ref: JuliaSmoothOptimizers/MUMPS.jl#30
This still might be a MUMPS specific problem, but who knows.

*** The MPI_Comm_f2c() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[travis-job-a0ef222a-fc2f-4590-8e85-4d82b932a93b:19201] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

@barche
Copy link
Collaborator

barche commented Oct 4, 2018

This looks like a different problem from the OP, where a missing symbol error was happening, suggesting the way we dlopen the library might be the problem. The MPI_INIT problem suggests something is wrong with MUMPS precompilation, maybe trying it with precompilation off just to verify this? CC @dpo

@dpo
Copy link

dpo commented Oct 4, 2018

Turning off precompilation and moving MPI.Init() up didn't help. I get the same error message on the TravisCI VMs. All is well on the CircleCI VMs.

@vchuravy
Copy link
Member

vchuravy commented Oct 7, 2018

@dpo have you managed to reproduce this outside of travis? I would attach gdb to the precompilation process and get a backtrace when MPI_Comm_f2c is called, to see where the rogue call is coming from.

@dpo
Copy link

dpo commented Oct 8, 2018

I have not been able to reproduce the issue. I set up an ArchLinux VM, and all worked well. Another user was able to confirm that all is well on an ArchLinux box, and a third one on an Ubuntu box (though running a newer version of Ubuntu than on TravisCI).

@dpo
Copy link

dpo commented Oct 8, 2018

I was just able to build and run the PR correctly on a Ubuntu 14.04 VM. So it seems something's up with the setup on TravisCI.

@simonbyrne
Copy link
Member

Closing this as it seems resolved, and the init code has changed substantially (see #271).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants