Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error when using theta-l_kokkos dycore with DEBUG mode turned on #15

Closed
sjsprecious opened this issue Feb 1, 2025 · 22 comments · Fixed by #16
Closed

Runtime error when using theta-l_kokkos dycore with DEBUG mode turned on #15

sjsprecious opened this issue Feb 1, 2025 · 22 comments · Fixed by #16
Assignees
Labels
bug Something isn't working

Comments

@sjsprecious
Copy link
Collaborator

What happened?

@jtruesdal reported that when using the theta-l_kokkos dycore with debug mode turned on, we could build the code successfully on Derecho but encountered a runtime error as shown below:

[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: Serial Runtime Configuration:
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX VECTOR_SIZE: 8
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX vector tag: SIMD
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX active AVX set: - AVX2 - AVX
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX MPI_ON_DEVICE: 1
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX CUDA_SHARE_BUFFER: off
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX CUDA_(MIN/MAX)_WARP_PER_TEAM: 1 / 1
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX doesn't have vector pragmas
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: HOMMEXX provided best default values in Config.hpp
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 0: compose> nelemd 12 qsize 12 hv_q 1 hv_subcycle_q 3 lim 9 independent_time_steps 1
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: forrtl: error (65): floating invalid
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: Image              PC                Routine            Line        Source             
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: libpthread-2.31.s  000014AF63DC78C0  Unknown               Unknown  Unknown
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           0000000001F012DF  Unknown               Unknown  Unknown
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           00000000011EA128  prim_driver_mod_m         266  prim_driver_mod.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           00000000011EA7CB  prim_driver_mod_m         336  prim_driver_mod.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           00000000011E2101  prim_driver_mod_m          60  prim_driver_mod.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           0000000000B56DE7  dyn_comp_mp_dyn_i         842  dyn_comp.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           000000000088EABB  cam_comp_mp_cam_i         184  cam_comp.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           000000000084CD34  atm_comp_mct_mp_a         249  atm_comp_mct.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           0000000000485222  component_mod_mp_         257  component_mod.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           0000000000433607  cime_comp_mod_mp_        1415  cime_comp_mod.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           000000000047C23E  MAIN__                    122  cime_driver.F90
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           0000000000420E2D  Unknown               Unknown  Unknown
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: [libc-2.31.so](http://libc-2.31.so/)       000014AF604B829D  __libc_start_main     Unknown  Unknown
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 91: cesm.exe           0000000000420D5A  Unknown               Unknown  Unknown
[dec1981.hsn.de.hpc.ucar.edu](http://dec1981.hsn.de.hpc.ucar.edu/) 85: forrtl: error (65): floating invalid

The error is back traced to the following function call in the prim_driver_mod.F90 code:

    elem_state_v_ptr         = c_loc(elem_state_v)
    elem_state_w_i_ptr       = c_loc(elem_state_w_i)
    elem_state_vtheta_dp_ptr = c_loc(elem_state_vtheta_dp)
    elem_state_phinh_i_ptr   = c_loc(elem_state_phinh_i)
    elem_state_dp3d_ptr      = c_loc(elem_state_dp3d)
    elem_state_Qdp_ptr       = c_loc(elem_state_Qdp)
    elem_state_ps_v_ptr      = c_loc(elem_state_ps_v)
    call init_elements_states_c (elem_state_v_ptr, elem_state_w_i_ptr, elem_state_vtheta_dp_ptr,   &
                                 elem_state_phinh_i_ptr, elem_state_dp3d_ptr, elem_state_ps_v_ptr, &
                                 elem_state_Qdp_ptr)

What are the steps to reproduce the bug?

See the Wiki page and set CAM_TARGET=theta-l_kokkos and DEBUG=TRUE.

What CAM tag were you using?

stormspeed branch

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

Yes

Extra info

No response

@sjsprecious sjsprecious added the bug Something isn't working label Feb 1, 2025
@sjsprecious sjsprecious self-assigned this Feb 1, 2025
@mt5555
Copy link

mt5555 commented Feb 3, 2025

you are probably already aware, but just in case: note that the theta-l-kokkos dycore in E3SM was never hooked up to the fortran interface - it's only used by the standalone HOMME or the EAMxx C++ atmosphere model. I'm not sure what's involved getting it to be called from the dp_coupling layer fortran interface - hopefully just calls to copy the state variables into the kokkos struct and copy out at the end of the dycore step?

@sjsprecious
Copy link
Collaborator Author

Thanks Mark for the additional information. This error is caught by the -fpe0 compiler option of the Intel compiler (only enabled when DEBUG mode is turned on), which means there is a floating-point exception such as division by zero, overflow and NaN values. @jtruesdal suspected that some arrays were not initialized properly.

I further tracked down the error to this Kokkos function and I guess that either qdp or dp is pointing to an invalid value.

Looking into the C++ code further, I found that when initializing qdp, two dimension names are referred to at different places (NUM_LEV and NUM_PHYSICAL_LEV). These two dimensions are not the same if HOMMEXX_VECTOR_SIZE does not equal to 1 (https://github.com/sjsprecious/E3SM/blob/master/components/homme/src/share/cxx/Dimensions.hpp#L46-L48). Is this intended as a Kokkos feature or is this a typo?

I could confirm that the runtime error is gone by setting HOMMEXX_VECTOR_SIZE=1 manually but I do not know why it works. Could you please let me know what the variable HOMMEXX_VECTOR_SIZE is used for?

@mt5555
Copy link

mt5555 commented Feb 3, 2025

for theta-l-kokkos, HOMME_VECTOR_SIZE=1 for GPUs, and 8 for CPUs ( its the loop blocking size, to enable CPU vectorization).

@sjsprecious
Copy link
Collaborator Author

Thanks @mt5555 . When HOMMEXX_VECTOR_SIZE does not equal to 1, NUM_PHYSICAL_LEVEL and NUM_LEV are then different. Both dimensions are referred to by the qdp variable (https://github.com/NCAR/StormSPEED/blob/stormspeed/src/dynamics/se/share/cxx/Tracers.hpp#L37 and https://github.com/NCAR/StormSPEED/blob/stormspeed/src/dynamics/se/share/cxx/Tracers.cpp#L60), which seems suspicious to me. Shall we use NUM_PHYSICAL_LEVEL or NUM_LEV for qdp consistently, or is this implementation correct as part of the Kokkos features?

@sjsprecious
Copy link
Collaborator Author

Hi @mt5555 , hopefully I am getting closer to the root cause here. The runtime error is gone when I set HOMMEXX_VECTOR_SIZE to 1 or 2, but is triggered when I set it to 4 or 8.

The FKESSLER test case uses PLEV=30, which can be divided evenly by 1 or 2, but not 4 or 8. Do you know whether there is a requirement that the vertical levels must be divided evenly by the loop blocking size? Thanks.

@mt5555
Copy link

mt5555 commented Feb 3, 2025

I'm not that familiar with this part of the code, but I believe NUM_PHYSICAL_LEVEL is what it says - the number of physical levels, while NUM_LEV probably includes the padding (so that the array is divisible by the vector length). running with a vector size of 1 is fine - it's just means loops wont vectorize on CPUs.

@sjsprecious
Copy link
Collaborator Author

Thanks @mt5555 . I guess we still want to vectorize the loops on the CPUs if possible for performance purpose.

Based on the following form:

NUM_LEV = (NUM_PHYSICAL_LEV + VECTOR_SIZE - 1) / VECTOR_SIZE

If NUM_PHYSICAL_LEV =30 and VECTOR_SIZE=1, we have NUM_LEV=30; if VECTOR_SIZE=2, we have NUM_LEV=15; if VECTOR_SIZE=8, we have NUM_LEV=4.

Thus NUM_LEV seems to be the dimension of the subarray where number of physical levels has been divided by the vector size + padding.

I suspect that padding is somehow not handled correctly here but I am not familiar with Kokkos. Do you know whether there is a person I could ask for this specific question? Thanks.

@mt5555
Copy link

mt5555 commented Feb 3, 2025

so for loops which are vectorized, looks like NUM_LEV is the size of the outerloop, and then the innner loop with be of size VECTOR_SIZE (and will be done with a AVX512 or similar type instruction). This would imply that all the kokkos arrays have two indicies for the vertical direction, (NUM_LEV,VECTOR_SIZE).

@bartgol might be able to quickly confirm.

@sjsprecious
Copy link
Collaborator Author

Thanks @mt5555 for the details. In this case, does Kokkos initialize the elements in the padding to zero by default? If so, this then explains the floating point exception for this function since dp is in the denominator. Or does Kokkos skip the elements in the padding automatically?

@bartgol
Copy link

bartgol commented Feb 3, 2025

Thanks @mt5555 . When HOMMEXX_VECTOR_SIZE does not equal to 1, NUM_PHYSICAL_LEVEL and NUM_LEV are then different. Both dimensions are referred to by the qdp variable (stormspeed/src/dynamics/se/share/cxx/Tracers.hpp#L37 and stormspeed/src/dynamics/se/share/cxx/Tracers.cpp#L60), which seems suspicious to me. Shall we use NUM_PHYSICAL_LEVEL or NUM_LEV for qdp consistently, or is this implementation correct as part of the Kokkos features?

The reason you see both NUM_LEV and NUM_PHYSICAL_LEV is b/c the views with the latter dim is simply wrapping a (host) pointer of the f90 data. Since f90 arrays have the physical dim, it must use that one (with raw Real data type, and not the Scalar simd-like type). Then, the sync_to_device utility copies the entries from the (host) view of the f90 data into the corresponding entry of the (device) simd-like view. The padding entries are not touched, and may contain garbage.

And in E3SM, the EAMxx model runs with VECTOR_SIZE=16, and we do run tests with 72 levels, which makes the padding non-trivial. So it's not necessary that VECTOR_SIZE divides NUM_PHYSICAL_LEV.

I think the issue stems from FPE's being enabled. When running with VECTOR_SIZE>1, we cannot allow FPE to throw, since padding virtually always leads to NaN entries. E.g., the pow function used to compute exner will throw if the basis is negative. To avoid this, EAMxx disables FPEs when control passes from the comp coupler to eamxx. While this may seem underwhelming (you would like the code to halt if NaN/Inf are encountered), the code will likely crash soon anyways: the dycore does check that the state is valid, and since NaN are likely to get assert-like checks to fail, you will likely have a crash very soon.

Disabling FPEs is the compromise you have to pay if you want to use simd-like structures that may be padded and contain garbage at the end.

Edit: to disable FPEs, you need to do something like (for gnu)

  feclearexcept(0); // clear excepts if already set
  feenableexcept(0); // disable ALL FE exception from now on

In EAMxx we store the current FPE mask before disabling exceptions, so that we can re-enable the same mask before returning control to the coupler. Something like

void my_func (...) {
 auto mask = fegetexcept();
 feclearexcept(0);
 feenableexcept(0);
 {
   // RUN CODE HERE
 }
 feenableexcept(mask);
}

@sjsprecious
Copy link
Collaborator Author

Thanks Luca for your detailed explanations. They help a lot!

Yes, this error is only triggered when FPE is enabled and what you have described makes sense to me. Do I understand it correctly that when disabling FPE, if the code is correct, having NaN values in the padding does not matter because they will be discarded or not used anyway? And if the code is wrong, it will crash due to other assert checks even without FPE?

Regarding the example you provided about the FPE mask, sorry that I do not know anything about Kokkos so I want to make sure I understand it correctly. Say that I am going to disable the exceptions for this Kokkos function, should I do something like below:

 ...some code...

 feclearexcept(0);
 feenableexcept(0);

  Kokkos::parallel_for(Kokkos::RangePolicy<ExecSpace>(0,size),
                       KOKKOS_LAMBDA(const int idx) {
    const int ie  =  idx / (qsize*NP*NP*NUM_LEV);
    const int iq  = (idx / (NP*NP*NUM_LEV)) % qsize;
    const int igp = (idx / (NP*NUM_LEV)) % NP;
    const int jgp = (idx / NUM_LEV) % NP;
    const int k   =  idx % NUM_LEV;

    q (ie,iq,igp,jgp,k) = qdp (ie,n0_qdp,iq,igp,jgp,k) / dp(ie,n0,igp,jgp,k);
  });

 feenableexcept(mask);

 ...rest of the code...

And is there a specific compiler flag I should use to enable it? Thanks.

@bartgol
Copy link

bartgol commented Feb 3, 2025

Do I understand it correctly that when disabling FPE, if the code is correct, having NaN values in the padding does not matter because they will be discarded or not used anyway? And if the code is wrong, it will crash due to other assert checks even without FPE?

Yes, that's correct. The values in the padding, do not matter for the simulation. And if you develop a true FPE in the part before the padding, the odds are the equation of state will crap out at some point (assuming the FPE's are in a state variable, or a variable indirectly affecting it).

We do test for FPEs in our code, but we explicitly set VECTOR_SIZE=1 for those tests, so that we don't get "spurious" fpes...

Regarding the example you provided about the FPE mask, sorry that I do not know anything about Kokkos so I want to make sure I understand it correctly. Say that I am going to disable the exceptions for this Kokkos function, should I do something like below:

 ...some code...

 feclearexcept(0);
 feenableexcept(0);

  Kokkos::parallel_for(Kokkos::RangePolicy<ExecSpace>(0,size),
                       KOKKOS_LAMBDA(const int idx) {
    const int ie  =  idx / (qsize*NP*NP*NUM_LEV);
    const int iq  = (idx / (NP*NP*NUM_LEV)) % qsize;
    const int igp = (idx / (NP*NUM_LEV)) % NP;
    const int jgp = (idx / NUM_LEV) % NP;
    const int k   =  idx % NUM_LEV;

    q (ie,iq,igp,jgp,k) = qdp (ie,n0_qdp,iq,igp,jgp,k) / dp(ie,n0,igp,jgp,k);
  });

 feenableexcept(mask);

 ...rest of the code...

And is there a specific compiler flag I should use to enable it? Thanks.
In this example, mask is unset. You should retrieve it from the fenv environment before clearing it, via auto mask = fegetexcept();. Other than that, this should work.

The feenableexcept stuff is from the fenv.h header, which should be part of the c++ standard library. It should be supported by any POSIX-compliant system. You can find more details e.g. here. You should not need any flag to use it, that I know of. But I do know we've had some issues on iOS systems.

@sjsprecious
Copy link
Collaborator Author

Thanks @bartgol so much for your detailed explanation. I could confirm that I can use the fenv.h stuff to disable the FPE explicitly for an invalid operation like 1.0 / 0.0. However, when I tried the same approach for the Kokkos function I mentioned above, the FPE was still triggered somehow. Do you have an example of how E3SM/EAMxx skips the FPE check for the Kokkos functions using the fenv.h stuff?

Also thanks for sharing how E3SM tests FPE and I think it is better to set VECTOR_SIZE=1 in CAM as well to avoid spurious FPE as you suggested.

@bartgol
Copy link

bartgol commented Feb 4, 2025

Thanks @bartgol so much for your detailed explanation. I could confirm that I can use the fenv.h stuff to disable the FPE explicitly for an invalid operation like 1.0 / 0.0. However, when I tried the same approach for the Kokkos function I mentioned above, the FPE was still triggered somehow. Do you have an example of how E3SM/EAMxx skips the FPE check for the Kokkos functions using the fenv.h stuff?

What is your kokkos backend? If you are running on a GPU, that may be the prolem. I'm not sure setting the fenv mask on host will affect the device handling of FPEs. However, on GPU backends, we always set VECTOR_SIZE=1, for other reasons, so there isn't really an FPE issue on GPU.

If, on the other hand, you are running with a CPU backend, then I'm puzzled.

Also thanks for sharing how E3SM tests FPE and I think it is better to set VECTOR_SIZE=1 in CAM as well to avoid spurious FPE as you suggested.

No problem!

@sjsprecious
Copy link
Collaborator Author

Hi @bartgol , thanks for your quick reply. I am using the CPU backend.

Here is a simple Kokkos example that may explain what I am doing better.

#include <iostream>
#include <cfenv>
#include <Kokkos_Core.hpp>

int main() {

    Kokkos::initialize();

    Kokkos::View<double*> numerator("numerator", 10);
    Kokkos::View<double*> denominator("denominator", 10);
    Kokkos::View<double*> result("result", 10);

    Kokkos::parallel_for("InitializeArrays", Kokkos::RangePolicy<>(0, 10), KOKKOS_LAMBDA(const int& i) {
        numerator(i) = 1.0;
        denominator(i) = 0.0;
    });

    feenableexcept(FE_ALL_EXCEPT);
    auto mask = fegetexcept();
    feclearexcept(0);
    feenableexcept(0);

    Kokkos::parallel_for("DivideByZeroCheck", Kokkos::RangePolicy<>(0, 10), KOKKOS_LAMBDA(const int& i) {
        result(i) = numerator(i) / denominator(i);
    });
    
    // Restore the previous exception flags
    feenableexcept(mask);

    Kokkos::parallel_for("DivideByZeroCheck", Kokkos::RangePolicy<>(0, 10), KOKKOS_LAMBDA(const int& i) {
        result(i) = numerator(i) / denominator(i);
    }); 

    // Deallocation before finalizing Kokkos
    result = Kokkos::View<double*>();
    denominator = Kokkos::View<double*>();
    numerator = Kokkos::View<double*>();

    Kokkos::finalize();

    return 0;
}

When I uncomment the line feenableexcept(FE_ALL_EXCEPT);, the code fails at the first Kokkos kernel even if I have set feenableexcept(0); before it; when I comment out the line feenableexcept(FE_ALL_EXCEPT);, the code runs to the end without a FPE, even if I am dividing by zero here. Am I doing something wrong here? Thanks.

@bartgol
Copy link

bartgol commented Feb 4, 2025

Are there any compiler flags that can affect FPE behavior? E.g., something like -ftrapv may cause FPEs to be raised regardless of what one does with the fenv stuff.

@sjsprecious
Copy link
Collaborator Author

Thanks @bartgol . I just compiled the example above with the intel compiler through:

icpx -O0 -g -lkokkoscore test.cpp

No other additional compiler flags are used here. I am using the kokkos/4.2.01 version, does it matter here?

@bartgol
Copy link

bartgol commented Feb 5, 2025

I don't think the kokkos version matters (that's roughly what we currently use in eamxx too).

I am running out of ideas. There are a few things to investigate:

  • kokkos may intercept some FPEs, causing the issue. You could ask the kokkos team on their slack space, pasting your snippet.
  • check the icpx specs to see if there's any mention of FPEs. Maybe icpx has a special interaction with fenv?
  • Maybe try with a good old GNU compiler, to see if everything else is sound.

@sjsprecious
Copy link
Collaborator Author

Thanks @bartgol . I will try your suggestions later and thank you again for pointing out the issue between vector length and FPEs here.

@sjsprecious
Copy link
Collaborator Author

Hi @bartgol , I just figured out that I should use the fegetexcept and feenableexcept functions defined from the ekat_feutils.hpp header to get the Kokkos example work correctly. It is the same for a naive 1.0/0.0 example as well but I did not notice that before.

@bartgol
Copy link

bartgol commented Feb 6, 2025

Ah, that may point to a fenv.h header that does not contain the info we expect. We do test that in our cmake with

check_cxx_symbol_exists(feenableexcept "fenv.h" EKAT_HAVE_FEENABLEEXCEPT)

and if it fails (and EKAT_HAVE_FEENABLEEXCEPT is undefined), the file ekat_feutils.hpp does provide an impl for a different fenv implementation. Specifically, that file was developed as a drop-in replacement for fenv on iOS. I thought you should not need it on Linux, but maybe you do? Do things work when you use ekat_feutils.hpp? Maybe the enable/disable fcns are not universal across fenv.h in C stdlibs...

@sjsprecious
Copy link
Collaborator Author

Thanks @bartgol . Yes, I guess the fenv.h in the C stdlibs on my Linux system is somehow different and the ekat_feutils.hpp works in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants