-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime error when using theta-l_kokkos dycore with DEBUG mode turned on #15
Comments
you are probably already aware, but just in case: note that the theta-l-kokkos dycore in E3SM was never hooked up to the fortran interface - it's only used by the standalone HOMME or the EAMxx C++ atmosphere model. I'm not sure what's involved getting it to be called from the dp_coupling layer fortran interface - hopefully just calls to copy the state variables into the kokkos struct and copy out at the end of the dycore step? |
Thanks Mark for the additional information. This error is caught by the I further tracked down the error to this Kokkos function and I guess that either Looking into the C++ code further, I found that when initializing I could confirm that the runtime error is gone by setting |
for theta-l-kokkos, HOMME_VECTOR_SIZE=1 for GPUs, and 8 for CPUs ( its the loop blocking size, to enable CPU vectorization). |
Thanks @mt5555 . When |
Hi @mt5555 , hopefully I am getting closer to the root cause here. The runtime error is gone when I set The FKESSLER test case uses |
I'm not that familiar with this part of the code, but I believe NUM_PHYSICAL_LEVEL is what it says - the number of physical levels, while NUM_LEV probably includes the padding (so that the array is divisible by the vector length). running with a vector size of 1 is fine - it's just means loops wont vectorize on CPUs. |
Thanks @mt5555 . I guess we still want to vectorize the loops on the CPUs if possible for performance purpose. Based on the following form:
If Thus I suspect that padding is somehow not handled correctly here but I am not familiar with Kokkos. Do you know whether there is a person I could ask for this specific question? Thanks. |
so for loops which are vectorized, looks like NUM_LEV is the size of the outerloop, and then the innner loop with be of size VECTOR_SIZE (and will be done with a AVX512 or similar type instruction). This would imply that all the kokkos arrays have two indicies for the vertical direction, (NUM_LEV,VECTOR_SIZE). @bartgol might be able to quickly confirm. |
Thanks @mt5555 for the details. In this case, does Kokkos initialize the elements in the padding to zero by default? If so, this then explains the floating point exception for this function since |
The reason you see both And in E3SM, the EAMxx model runs with VECTOR_SIZE=16, and we do run tests with 72 levels, which makes the padding non-trivial. So it's not necessary that VECTOR_SIZE divides NUM_PHYSICAL_LEV. I think the issue stems from FPE's being enabled. When running with VECTOR_SIZE>1, we cannot allow FPE to throw, since padding virtually always leads to NaN entries. E.g., the pow function used to compute exner will throw if the basis is negative. To avoid this, EAMxx disables FPEs when control passes from the comp coupler to eamxx. While this may seem underwhelming (you would like the code to halt if NaN/Inf are encountered), the code will likely crash soon anyways: the dycore does check that the state is valid, and since NaN are likely to get assert-like checks to fail, you will likely have a crash very soon. Disabling FPEs is the compromise you have to pay if you want to use simd-like structures that may be padded and contain garbage at the end. Edit: to disable FPEs, you need to do something like (for gnu) feclearexcept(0); // clear excepts if already set
feenableexcept(0); // disable ALL FE exception from now on In EAMxx we store the current FPE mask before disabling exceptions, so that we can re-enable the same mask before returning control to the coupler. Something like void my_func (...) {
auto mask = fegetexcept();
feclearexcept(0);
feenableexcept(0);
{
// RUN CODE HERE
}
feenableexcept(mask);
} |
Thanks Luca for your detailed explanations. They help a lot! Yes, this error is only triggered when FPE is enabled and what you have described makes sense to me. Do I understand it correctly that when disabling FPE, if the code is correct, having NaN values in the padding does not matter because they will be discarded or not used anyway? And if the code is wrong, it will crash due to other assert checks even without FPE? Regarding the example you provided about the FPE mask, sorry that I do not know anything about Kokkos so I want to make sure I understand it correctly. Say that I am going to disable the exceptions for this Kokkos function, should I do something like below:
And is there a specific compiler flag I should use to enable it? Thanks. |
Yes, that's correct. The values in the padding, do not matter for the simulation. And if you develop a true FPE in the part before the padding, the odds are the equation of state will crap out at some point (assuming the FPE's are in a state variable, or a variable indirectly affecting it). We do test for FPEs in our code, but we explicitly set VECTOR_SIZE=1 for those tests, so that we don't get "spurious" fpes...
The |
Thanks @bartgol so much for your detailed explanation. I could confirm that I can use the Also thanks for sharing how E3SM tests FPE and I think it is better to set |
What is your kokkos backend? If you are running on a GPU, that may be the prolem. I'm not sure setting the fenv mask on host will affect the device handling of FPEs. However, on GPU backends, we always set VECTOR_SIZE=1, for other reasons, so there isn't really an FPE issue on GPU. If, on the other hand, you are running with a CPU backend, then I'm puzzled.
No problem! |
Hi @bartgol , thanks for your quick reply. I am using the CPU backend. Here is a simple Kokkos example that may explain what I am doing better.
When I uncomment the line |
Are there any compiler flags that can affect FPE behavior? E.g., something like |
Thanks @bartgol . I just compiled the example above with the intel compiler through:
No other additional compiler flags are used here. I am using the |
I don't think the kokkos version matters (that's roughly what we currently use in eamxx too). I am running out of ideas. There are a few things to investigate:
|
Thanks @bartgol . I will try your suggestions later and thank you again for pointing out the issue between vector length and FPEs here. |
Hi @bartgol , I just figured out that I should use the |
Ah, that may point to a fenv.h header that does not contain the info we expect. We do test that in our cmake with check_cxx_symbol_exists(feenableexcept "fenv.h" EKAT_HAVE_FEENABLEEXCEPT) and if it fails (and |
Thanks @bartgol . Yes, I guess the |
What happened?
@jtruesdal reported that when using the theta-l_kokkos dycore with debug mode turned on, we could build the code successfully on Derecho but encountered a runtime error as shown below:
The error is back traced to the following function call in the
prim_driver_mod.F90
code:What are the steps to reproduce the bug?
See the Wiki page and set
CAM_TARGET=theta-l_kokkos
andDEBUG=TRUE
.What CAM tag were you using?
stormspeed branch
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
No response
Will you be addressing this bug yourself?
Yes
Extra info
No response
The text was updated successfully, but these errors were encountered: