Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only make a retry backup when it is needed #812

Merged
merged 22 commits into from
Mar 16, 2020
Merged

Conversation

maximumcats
Copy link
Member

@maximumcats maximumcats commented Mar 11, 2020

PR summary

Presently if use_retry is enabled we make a pre-emptive copy of the StateData right at the beginning of the timestep. We then back up to that copy of the state if a retry is triggered, and try again with a shorter timestep. Since use_retry was so useful, we made it the default, but this meant that we were always doubling the memory requirement of the state, and on GPUs this is a problem due to the limited memory capacity. To avoid this problem we placed the backup state in CPU pinned memory. This meant that making the backup was relatively expensive (as the bandwidth is now the bandwidth of the CPU-GPU interconnect, which is much lower than the bandwidth of GPU DRAM), but at the time we did it (commit 0a95b63) the cost of the advance on the GPU was much larger than the cost of the copy. However, our recent C++ ports have made the advance so fast that saving the backup state is now the single largest cost in the timestep. So we need a strategy that does not require saving the backup state on every step.

The proposed solution is to only backup when we detect that a retry is needed. This means that we first attempt the advance and then we make the backup of the state if the advance failed. The backup state (called prev_state in the code) is thus a copy of the failed advance, but this is actually acceptable because we don't ever need the new data* from the backup (just the old data), and the old data should be approximately the same independent of whether we retry with a smaller timestep. In fact, this approach is actually better than the status quo. At present, if we have to retry, we're backing up to a copy of the state made before any advance was done. Now, the reason we have to make a backup of the old data at all is that we want the old data at the end of the timestep, after all subcycles have been completed, to be the data at the start of the whole timestep, not the start of the last subcycle (this is necessary for valid time interpolations on the fine level). But by backing up to the old data before any advance was done, we actually lose out on the reset of the old data that happens in the first subcycle, which is something we want to happen so there's fully valid old-time data (that is, for the same reason we have an old-time update to begin with in the timestep, rather than just accepting the new-time data from the last step). This does mean we will see changes in the test suite for problems with retries.

A second benefit is that we no longer need to back up all the way to the beginning of a step when we hit a retry. We simply retry the same subcycle with a shorter timestep. This is valid as long as the CTU advance is written in an idempotent way where doing it multiple times in a row doesn't result in unintended results, and that is true given the way we have written it: the old source term data simply gets recomputed and overwritten, and the new data gets copied from the old data and then added to. The only exception is the aforementioned source term corrector, which has been dealt with.

The only caveat with this approach is that there is one piece of data that actually depends directly on data from the last timestep, which is the source term predictor in CTU, or the last iteration's correction term in simplified SDC. The strategy above would not give us a valid way to restore the source term predictor after a retry, since we've already erased the data from the last timestep. So what we do now for CTU is to save off the source term predictor into a separate array (source_corrector) in the first pass through the timestep. The saved source corrector is established before any changes are made to the StateData, so it is indeed the data we want. If we do a retry, we just refer back to that saved data rather than recalculate it (which would now be impossible). This does add some additional memory requirements, but we can make up for it later (for example, sources_for_hydro doesn't really need to be stored for the full timestep anymore). For simplified SDC, we cannot quite do this because the failure could occur in the second SDC iteration, at which point we'd have lost the lagged solution from the previous timestep, and the lagged solution from the failed advance cannot be trusted. So the solution is to zero out the lagged sources and then take an extra SDC iteration to compensate.

*This does mean we're saving more data than we need. As an optimization for later we can just save the old data.

PR checklist

  • test suite needs to be run on this PR
  • this PR will change answers in the test suite
  • all functions have docstrings as per the coding conventions
  • the CHANGES file has been updated
  • if appropriate, this change is described in the docs

@maximumcats maximumcats changed the title Source corrector Only make a retry backup when it is needed Mar 12, 2020
@maximumcats maximumcats marked this pull request as ready for review March 12, 2020 02:47
@zingale
Copy link
Member

zingale commented Mar 12, 2020

@maximumcats
Copy link
Member Author

I've fixed some bugs in the simplified SDC implementation. Note: Detonation will still not match, the problem is very sensitive and my changes will cause it to look somewhat different and likely end on a different timestep.

@zingale zingale merged commit 3adf1a4 into development Mar 16, 2020
@zingale zingale deleted the source_corrector branch March 16, 2020 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants