-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only make a retry backup when it is needed #812
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit 1d1cd5c.
tests: http://groot.astro.sunysb.edu/Castro/test-suite/gfortran/2020-03-12-001/index.html detonation did not output. |
I've fixed some bugs in the simplified SDC implementation. Note: Detonation will still not match, the problem is very sensitive and my changes will cause it to look somewhat different and likely end on a different timestep. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR summary
Presently if use_retry is enabled we make a pre-emptive copy of the StateData right at the beginning of the timestep. We then back up to that copy of the state if a retry is triggered, and try again with a shorter timestep. Since use_retry was so useful, we made it the default, but this meant that we were always doubling the memory requirement of the state, and on GPUs this is a problem due to the limited memory capacity. To avoid this problem we placed the backup state in CPU pinned memory. This meant that making the backup was relatively expensive (as the bandwidth is now the bandwidth of the CPU-GPU interconnect, which is much lower than the bandwidth of GPU DRAM), but at the time we did it (commit 0a95b63) the cost of the advance on the GPU was much larger than the cost of the copy. However, our recent C++ ports have made the advance so fast that saving the backup state is now the single largest cost in the timestep. So we need a strategy that does not require saving the backup state on every step.
The proposed solution is to only backup when we detect that a retry is needed. This means that we first attempt the advance and then we make the backup of the state if the advance failed. The backup state (called
prev_state
in the code) is thus a copy of the failed advance, but this is actually acceptable because we don't ever need the new data* from the backup (just the old data), and the old data should be approximately the same independent of whether we retry with a smaller timestep. In fact, this approach is actually better than the status quo. At present, if we have to retry, we're backing up to a copy of the state made before any advance was done. Now, the reason we have to make a backup of the old data at all is that we want the old data at the end of the timestep, after all subcycles have been completed, to be the data at the start of the whole timestep, not the start of the last subcycle (this is necessary for valid time interpolations on the fine level). But by backing up to the old data before any advance was done, we actually lose out on the reset of the old data that happens in the first subcycle, which is something we want to happen so there's fully valid old-time data (that is, for the same reason we have an old-time update to begin with in the timestep, rather than just accepting the new-time data from the last step). This does mean we will see changes in the test suite for problems with retries.A second benefit is that we no longer need to back up all the way to the beginning of a step when we hit a retry. We simply retry the same subcycle with a shorter timestep. This is valid as long as the CTU advance is written in an idempotent way where doing it multiple times in a row doesn't result in unintended results, and that is true given the way we have written it: the old source term data simply gets recomputed and overwritten, and the new data gets copied from the old data and then added to. The only exception is the aforementioned source term corrector, which has been dealt with.
The only caveat with this approach is that there is one piece of data that actually depends directly on data from the last timestep, which is the source term predictor in CTU, or the last iteration's correction term in simplified SDC. The strategy above would not give us a valid way to restore the source term predictor after a retry, since we've already erased the data from the last timestep. So what we do now for CTU is to save off the source term predictor into a separate array (
source_corrector
) in the first pass through the timestep. The saved source corrector is established before any changes are made to the StateData, so it is indeed the data we want. If we do a retry, we just refer back to that saved data rather than recalculate it (which would now be impossible). This does add some additional memory requirements, but we can make up for it later (for example,sources_for_hydro
doesn't really need to be stored for the full timestep anymore). For simplified SDC, we cannot quite do this because the failure could occur in the second SDC iteration, at which point we'd have lost the lagged solution from the previous timestep, and the lagged solution from the failed advance cannot be trusted. So the solution is to zero out the lagged sources and then take an extra SDC iteration to compensate.*This does mean we're saving more data than we need. As an optimization for later we can just save the old data.
PR checklist
CHANGES
file has been updated