Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openblas runs single thread with OMP_PROC_BIND=TRUE or GOMP_CPU_AFFINITY #2238

Closed
puneet336 opened this issue Aug 26, 2019 · 18 comments
Closed

Comments

@puneet336
Copy link

puneet336 commented Aug 26, 2019

Hi,
I had compiled hpl-2.3 with openblas 0.3.7 ( gcc 9.2 + openmpi 4.1.0 ) on centos 7. Here is how i was running HPL -


time mpirun -np $_PROCS -N $_PROCS $mpi_options ./appfile.sh

where appfile.sh has -

#!/bin/bash
export OMP_NUM_THREADS=$(head -n1 mapping.info |cut -d: -f2|sed "s/,/ /g"|wc -w)
export OPENBLAS_NUM_THREADS=4

_RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
_FILE_LINE=$(expr ${_RANK} + 1)
_MAPPING=$(sed -n ${_FILE_LINE}p mapping.info)
_GOMP=$(echo $_MAPPING|awk -F: '{print $2}')
_IMPI=$(echo $_MAPPING|awk -F: '{print $2}')
_MEMB=$(echo $_MAPPING|awk -F: '{print $1}')


echo "MESSAGE:rank ${_RANK}($_GOMP) on cpuset=$_IMPI memset=$_MEMB"
export GOMP_CPU_AFFINITY="$_GOMP"
export I_MPI_PIN_PROCESSOR_LIST=$_IMPI
export OMP_PROC_BIND=TRUE

numactl --physcpubind=$_IMPI --membind=$_MEMB ./xhpl

I noticed that hpl was running on single thread.

To narrow this issue down , i wrote a simpler code -

#include <omp.h>
#include "cblas.h"
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{

    // Beginning of parallel region
    #pragma omp parallel
    {

        printf("omp_get_num_threads = %d openblas_get_num_threads=%d\n",omp_get_num_threads(),openblas_get_num_threads());
    }
    // Ending of parallel region
}

with aforementioned program, i get following with mpirun :

...
MESSAGE:rank 21(84,85,86,87) on cpuset=84,85,86,87 memset=5
omp_get_num_threads = 4 openblas_get_num_threads=1
omp_get_num_threads = 4 openblas_get_num_threads=1
omp_get_num_threads = 4 openblas_get_num_threads=1
...

but when i eliminate GOMP_CPU_AFFINITY and OMP_PROC_BIND from my scripts -

...
MESSAGE:rank 21(84,85,86,87) on cpuset=84,85,86,87 memset=5
omp_get_num_threads = 4 openblas_get_num_threads=4
omp_get_num_threads = 4 openblas_get_num_threads=4
omp_get_num_threads = 4 openblas_get_num_threads=4
omp_get_num_threads = 4 openblas_get_num_threads=4
MESSAGE:rank 26(104,105,106,107) on cpuset=104,105,106,107 memset=6
omp_get_num_threads = 4 openblas_get_num_threads=4
...

Is there a bug in latest release of openblas?

Edit:
tested with 0.3.6, got same issue. Hope this issue is fixed in 0.3.8

Note that this issue shows up - with NO_AFFINTY=1 and without.

@brada4
Copy link
Contributor

brada4 commented Aug 26, 2019

First command line does not compile OpenBLAS.
Probably GOMP sets affinity when BLAS function is called and all OpenBLAS threads continue with same affinity mask i.e. single core.

@puneet336
Copy link
Author

puneet336 commented Aug 26, 2019

Yes, First command shows how i am "running" a script via mpirun.
"Here is how i was running HPL"
seems SanazGeibi faced this issue ( Mar14) - #2052

Edit1:
(as suggested in the post) with OMP_PLACES=sockets , i am able to see multiple threads per process.

Edit2:
with OMP_PLACES , i am unable to pin threads to desired core ids.
example - I need to place processes at every fourth core, but seems OMP_PLACES=sockets is not allowing me to do so (screenshot of top - before process spawns threads).

image

@brada4
Copy link
Contributor

brada4 commented Aug 26, 2019

Caches are shared between multiple cores, there is no technical merit in your effort, OS schedulers do better than 1:1 ever since advent of hyperthreading.

@puneet336
Copy link
Author

puneet336 commented Aug 27, 2019

Hi,
Tested blis and i got the expected behavior (screenshot attached).

Seems "OS scheduling policy" works as expected in case of mkl & blis.
Not sure why pinning (GOMP_CPU_AFFINITY/OMP_PROC_BIND) doesn't work with openblas .

Will keep a tab on upcoming releases of openmpi - if they solve this issue.

image

@brada4
Copy link
Contributor

brada4 commented Aug 27, 2019

You still did not share OpenBLAS build options. for pthread version you need to set threads to be used in BLAS calls, for OMP version it collapses to single-threaded in OMP parallel section.

@puneet336
Copy link
Author

puneet336 commented Aug 27, 2019

Here are the command lines which i used for compiling openblas-0.3.7 (i have tried linking versions with HPL) -

OPT_FLAGS="-W -Wall -fopenmp -O3  -ftree-vectorize -march=znver2 -mtune=znver2 -mfma -mavx2 -mavx"
export COMMON_OPT="$OPT_FLAGS"
make CC="$(which gcc)" FC="$(which gfortran)" CFLAGS="$OPT_FLAGS" FFLAGS="$OPT_FLAGS" TARGET=ZEN USE_OPENMP=1 NO_AFFINTY=1 -j 64 2>&1|tee make_logs_$(hostname)
make PREFIX=/home/puneet/MySoftwares/LIBRARY/OpenBLAS/0.3.7_gcc9.2.0v03 install 2>&1|tee make_install_$(hostname)
export OPT_FLAGS="-W -Wall -fopenmp -O2  -ftree-vectorize -march=znver2 -mtune=znver2 -mfma -mavx2 -mavx"
export COMMON_OPT="$OPT_FLAGS"
make CC="$(which gcc)" FC="$(which gfortran)" CFLAGS="$OPT_FLAGS" FFLAGS="$OPT_FLAGS" TARGET=ZEN USE_OPENMP=1 -j 64 2>&1|tee make_logs_$(hostname)
make PREFIX=/home/puneet/MySoftwares/LIBRARY/OpenBLAS/0.3.7_gcc9.2.0v01 install 2>&1|tee make_install_$(hostname)

please let me know if there are issues with the compilation method.

@martin-frbg
Copy link
Collaborator

No issues except that you probably do not want NO_AFFINITY defined in this context. This looks to be a duplicate of #2052 as you already noted, just curious why you are supplying your own CFLAGS ?
As MKL and BLIS appear to do the right thing already, there is probably no need to check the core assignment by setting the environment variable OMP_DISPLAY_ENV as suggested in http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html

@brada4
Copy link
Contributor

brada4 commented Aug 27, 2019

Actually it is quite dangerous to trick with FFLAGS, as overriden it makes LAPACK part not thread-safe
https://github.com/xianyi/OpenBLAS/pull/1857/files

@jussienko
Copy link
Contributor

jussienko commented Jul 6, 2020

There is a similar issue, OpenBLAS running single threaded with OpenMP affinity setting OMP_PLACES=threads. By looking the code in develop branch, the reason seems to be following:

During initialization, gotoblas_init calls get_num_procs, which on Linux uses sched_getaffinity to determine maximum number of logical cpus available and stores the result in blas_cpu_number. If OpenMP runtime has set an affinity mask (i.e. via OMP_PLACES or OMP_PROC_BIND) variables, blas_cpu_number will be smaller than OMP_NUM_THREADS, in case of OMP_PLACES=threads the master thread is bound to single hardware thread and blas_cpu_number=1.

Later on, OpenBLAS in principle tries to use number of threads specified in OMP_NUM_THREADS by calling omp_get_max_threads in function num_cpu_avail in the file common_thread.h

static __inline int num_cpu_avail(int level) {

#ifdef USE_OPENMP
        int openmp_nthreads=0;
#endif

  if (blas_cpu_number == 1

#ifdef USE_OPENMP
      || omp_in_parallel()
#endif
      ) return 1;

#ifdef USE_OPENMP
  openmp_nthreads=omp_get_max_threads();
  if (blas_cpu_number != openmp_nthreads) {
          goto_set_num_threads(openmp_nthreads);
  }
#endif

  return blas_cpu_number;
}

However, due to if (blas_cpu_number == 1 statement, the function ignores OMP_NUM_THREADS when OMP_PLACES=threads

Issue could be resolved by not using affinity mask during gotoblas_init, or by obeying OMP_NUM_THREADS in num_cpu_avail also when blas_cpu_number == 1. However, I am not sure they might cause side effects.

@martin-frbg
Copy link
Collaborator

@jussienko Thanks, this is certainly worth investigating. From #2380 (comment) it is also possible that the code hits a design flaw in libgomp.

@jussienko
Copy link
Contributor

When running OpenBLAS under debugger, one sees (in my system which supports hyperthreading) that with OMP_PLACES=cores, during initialization blas_cpu_number==2 (as the affinity mask which is set by OMP runtime allows both hyperthreads), and in num_cpu_avail the correct omp_get_max_threads is used, while OMP_PLACES=threads produces the problem described above.

I made just a pull request (#2706) which contains one possible fix.

@martin-frbg
Copy link
Collaborator

Hm. If the actual problem is "only" with OMP_PLACES=threads I'd rather test for that case (i.e. add another getenv) ? (I already prepared that last night, just did not have a chance to test it)

@jussienko
Copy link
Contributor

I guess OMP_PLACES=threads is the most common, but in principle any setting that limits master thread to single hyperthread reproduces the issue. For example in my four core workstation (0-3 are the first hyperthreads within the physical cores) all the following settings will make OpenBLAS to run only with single thread:

  • OMP_PLACES='{0},{1},{2},{3}' ./blas_test
  • OMP_PLACES=cores taskset -c 0,1,2,3 ./blas_test
  • OMP_PLACES=cores numactl -C 0-3 ./blas_test

@martin-frbg
Copy link
Collaborator

You are right my half-baked attempt is not going to work then.

@jussienko
Copy link
Contributor

Also, in systems without hyperthreading, just OMP_PLACES=cores produces the issue (I just made a short test for that)

@martin-frbg
Copy link
Collaborator

Thanks for doing these tests - as far as I can tell, the relevant code in common_thread.h is essentially unchanged from the libGoto2 of ten years ago, so saw OpenMP 3.1 (which introduced OMP_PROC_BIND IIRC) at best.

@jussienko
Copy link
Contributor

Actually, the master branch works ok (I tested that first before realizing that it is pretty old and develop is the correct one), but the reason is that in there get_num_procs (used in gotoblas_init) uses only sysconf(_SC_NPROCESSORS_CONF) for setting blas_cpu_number and not sched_getaffinity.

@martin-frbg
Copy link
Collaborator

Err, well, I certainly fumbled with that after 0.2.20. So perhaps sched_getaffinity should not be used in an OpenMP context, at least not when OMP_PLACES or any other "modern" form of OpenMP-based affinity handling is in effect.

jussienko added a commit to CSCfi/csc-user-guide that referenced this issue Aug 26, 2020
attesillanpaa pushed a commit to CSCfi/csc-user-guide that referenced this issue Aug 26, 2020
* Better options for aocc, clarify aocc is in clang module

* Due to bug in OpenBLAS remove thread binding from defaults

The bug OpenMathLib/OpenBLAS#2238 is fixed in the
development version

* REmove fixme

* Fix table formatting

* Improve AMD naming

* Typo fix

* Add example with thread binding (with Openblas caveat)

* Specify full module name

* typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants