Only make a retry backup when it is needed #812

maximumcats · 2020-03-11T20:19:39Z

PR summary

Presently if use_retry is enabled we make a pre-emptive copy of the StateData right at the beginning of the timestep. We then back up to that copy of the state if a retry is triggered, and try again with a shorter timestep. Since use_retry was so useful, we made it the default, but this meant that we were always doubling the memory requirement of the state, and on GPUs this is a problem due to the limited memory capacity. To avoid this problem we placed the backup state in CPU pinned memory. This meant that making the backup was relatively expensive (as the bandwidth is now the bandwidth of the CPU-GPU interconnect, which is much lower than the bandwidth of GPU DRAM), but at the time we did it (commit 0a95b63) the cost of the advance on the GPU was much larger than the cost of the copy. However, our recent C++ ports have made the advance so fast that saving the backup state is now the single largest cost in the timestep. So we need a strategy that does not require saving the backup state on every step.

The proposed solution is to only backup when we detect that a retry is needed. This means that we first attempt the advance and then we make the backup of the state if the advance failed. The backup state (called prev_state in the code) is thus a copy of the failed advance, but this is actually acceptable because we don't ever need the new data* from the backup (just the old data), and the old data should be approximately the same independent of whether we retry with a smaller timestep. In fact, this approach is actually better than the status quo. At present, if we have to retry, we're backing up to a copy of the state made before any advance was done. Now, the reason we have to make a backup of the old data at all is that we want the old data at the end of the timestep, after all subcycles have been completed, to be the data at the start of the whole timestep, not the start of the last subcycle (this is necessary for valid time interpolations on the fine level). But by backing up to the old data before any advance was done, we actually lose out on the reset of the old data that happens in the first subcycle, which is something we want to happen so there's fully valid old-time data (that is, for the same reason we have an old-time update to begin with in the timestep, rather than just accepting the new-time data from the last step). This does mean we will see changes in the test suite for problems with retries.

A second benefit is that we no longer need to back up all the way to the beginning of a step when we hit a retry. We simply retry the same subcycle with a shorter timestep. This is valid as long as the CTU advance is written in an idempotent way where doing it multiple times in a row doesn't result in unintended results, and that is true given the way we have written it: the old source term data simply gets recomputed and overwritten, and the new data gets copied from the old data and then added to. The only exception is the aforementioned source term corrector, which has been dealt with.

The only caveat with this approach is that there is one piece of data that actually depends directly on data from the last timestep, which is the source term predictor in CTU, or the last iteration's correction term in simplified SDC. The strategy above would not give us a valid way to restore the source term predictor after a retry, since we've already erased the data from the last timestep. So what we do now for CTU is to save off the source term predictor into a separate array (source_corrector) in the first pass through the timestep. The saved source corrector is established before any changes are made to the StateData, so it is indeed the data we want. If we do a retry, we just refer back to that saved data rather than recalculate it (which would now be impossible). This does add some additional memory requirements, but we can make up for it later (for example, sources_for_hydro doesn't really need to be stored for the full timestep anymore). For simplified SDC, we cannot quite do this because the failure could occur in the second SDC iteration, at which point we'd have lost the lagged solution from the previous timestep, and the lagged solution from the failed advance cannot be trusted. So the solution is to zero out the lagged sources and then take an extra SDC iteration to compensate.

*This does mean we're saving more data than we need. As an optimization for later we can just save the old data.

PR checklist

test suite needs to be run on this PR
this PR will change answers in the test suite
all functions have docstrings as per the coding conventions
the CHANGES file has been updated
if appropriate, this change is described in the docs

This reverts commit 1d1cd5c.

zingale · 2020-03-12T16:17:20Z

tests: http://groot.astro.sunysb.edu/Castro/test-suite/gfortran/2020-03-12-001/index.html

detonation did not output.

maximumcats · 2020-03-13T19:08:20Z

I've fixed some bugs in the simplified SDC implementation. Note: Detonation will still not match, the problem is very sensitive and my changes will cause it to look somewhat different and likely end on a different timestep.

maximumcats added 9 commits March 9, 2020 17:45

Merge SDC source and source_term_predictor

5ad7510

Clear memory

469c9b4

Do the source corrector after the swap

177aba3

Don't need to store the MultiFab

1d1cd5c

Merge branch 'development' into source_corrector

bd3dd41

Revert "Don't need to store the MultiFab"

53f66fb

This reverts commit 1d1cd5c.

Don't recalculate on a retry

29f463c

Remove unused data

abf4788

Only save the state if we're doing a retry

3e82e8c

maximumcats changed the title ~~Source corrector~~ Only make a retry backup when it is needed Mar 12, 2020

maximumcats marked this pull request as ready for review March 12, 2020 02:47

maximumcats added 2 commits March 11, 2020 20:02

Move the source corrector save up in case the burn fails

23a2914

We don't need to reset the iteration count on a retry

c69021a

maximumcats added 10 commits March 12, 2020 10:08

Merge branch 'development' into source_corrector

a3d54ae

Merge branch 'development' into source_corrector

fb0c7e1

Fix corrector condition for simplified SDC

3534d51

Re-fix source corrector

dc21f83

Zero out lagged source terms after a failed advance

e0b1f4c

Merge branch 'development' into source_corrector

56ee025

Add an ifdef

ba49c05

Add an SDC iteration after a retry

75db9ce

Add a period. For my sanity.

fa587b2

Fix some logic

33ec40f

Merge branch 'development' into source_corrector

fc7bf8d

zingale merged commit 3adf1a4 into development Mar 16, 2020

zingale deleted the source_corrector branch March 16, 2020 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only make a retry backup when it is needed #812

Only make a retry backup when it is needed #812

maximumcats commented Mar 11, 2020 •

edited

Loading

zingale commented Mar 12, 2020

maximumcats commented Mar 13, 2020

Only make a retry backup when it is needed #812

Only make a retry backup when it is needed #812

Conversation

maximumcats commented Mar 11, 2020 • edited Loading

PR summary

PR checklist

zingale commented Mar 12, 2020

maximumcats commented Mar 13, 2020

maximumcats commented Mar 11, 2020 •

edited

Loading