Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOX] Clean up of NOX::Solver::SingleStep #379

Open
vovannikov opened this issue Feb 14, 2025 · 5 comments
Open

[NOX] Clean up of NOX::Solver::SingleStep #379

vovannikov opened this issue Feb 14, 2025 · 5 comments
Labels
taskforce: tpetra Issues related to the migration from Epetra to Tpetra team: solvers type: enhancement A new feature or enhancement to be implemented

Comments

@vovannikov
Copy link
Contributor

vovannikov commented Feb 14, 2025

Description

The more I look into the code, the more I figure out that certain "non-optimal" design desicions in 4C were made due to the mistakes in the design of Trilinos itself. And I think this is the right time to figure out how they should be tackled properly.

There is SingleStep solver in 4C which is based on NOX::Solver::SingleStep from Trilinos and its solve() function looks like this:

enum Inpar::Solid::ConvergenceStatus Solid::Nln::SOLVER::SingleStep::solve()
{
check_init_setup();
auto& nln_group = dynamic_cast<NOX::Nln::Group&>(group());
const auto& x_epetra = dynamic_cast<const ::NOX::Epetra::Vector&>(group().getX());
nln_group.set_is_valid_newton(true); // to circumvent the check in ::NOX::Solver::SingleStep
nln_group.set_is_valid_rhs(false); // force to compute the RHS
nln_group.reset_x(); // to initialize the solution vector to zero
//// do one non-linear step using solve
::NOX::StatusTest::StatusType stepstatus = nlnsolver_->solve();
// copy the solution group into the class variable
group() = nlnsolver_->getSolutionGroup();
integrator().set_state(Core::LinAlg::Vector<double>(x_epetra.getEpetraVector()));
return convert_final_status(stepstatus);
}

You may see that the step() function is customized with some strange calls:

  nln_group.set_is_valid_newton(true);  // to circumvent the check in ::NOX::Solver::SingleStep
  nln_group.set_is_valid_rhs(false);    // force to compute the RHS
  nln_group.reset_x();                  // to initialize the solution vector to zero

Note that set_is_valid_newton() and set_is_valid_rhs() are used only in the implementation of the single step solver, not anywhere else in the code, which makes me think that it was a clear hack to resolve some problem.

Indeed, this implementation is dictated by this piece of code
https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src/NOX_Solver_SingleStep.C#L131-L138

Here the author optimized memory usage. Well, const_cast is not nice, but what makes it worse is that getNewton() can thrown an exception:
https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src-epetra/NOX_Epetra_Group.C#L547-L555

And the only way to set flag isValidNewton to true is to call Group::computeNewton(), and that call is not performed in NOX::Solver::SingleStep. Aparently, this makes NOX::Epetra::Group absolutely incompatible with NOX::Solver::SingleStep. And what it makes the situation even worse is that this behavior is not consistent between different group implementations in Trilinos itself. For instance, getNewton() in Thyra::Group can not throw a similar exception because it has not similar checks inside: https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src-thyra/NOX_Thyra_Group.C#L750-L753. Well, in fact, it can throw an exception, but due to dereferencing a nullptr (if that happens), but not due that the flag isNewtonValid=false.

Then a question may arrise: how did Trilinos authors not notice this ever? I think the answer is that they have never used Epetra::Group a lot themselves and mainly used Thyra, at least there is only one test in Trilinos for the single step solver and it uses Thyra: https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/test/epetra/Thyra/Thyra_SingleStepSolver_1D.C.

Possible solutions

  1. Do not use this memory optimization in SingleStep solver in Trilinos, should be harmless except from the slightly larger memory consumption, the vector can be created in ctor and then reused. I would go for this option, I think this is a bad design decision in Trilinos.
  2. Remove check isNewton() from Epetra::Group class in Trilinos, but that can be risky since this is a very old legacy code and we do not know whether other codes rely on this behavior.
  3. Continue using hacks in 4C. (no way)
  4. Use our own group class.
  5. Wrappers around Trilinos classes and no inheritance from Trilinos classes (like @sebproell did for LinAlg::Vector).

Conclusion

This is a single example. I bet there are more of this, if I keep on tracking the code. While at the beginning I was a bit sceptical about getting rid of Epetra::Group as one of the first steps, I think that the need for switching to our own group will get more and more pressing. Epetra::Group is very much stateful, its behavior is so much dependent on the interplay between its flags (which are a lot), that makes it difficult to use it properly for various solution scenarious which may imply evaluation of only certain parts of the group.

I think that probably switching to our own group would still be a bit too much at the moment, since some parts of the code can still be optimized without this move.

@vryy @mayrmt @amgebauer @sebproell @maxfirmbach @georghammerl

@vovannikov vovannikov added the type: enhancement A new feature or enhancement to be implemented label Feb 14, 2025
@sebproell sebproell added taskforce: tpetra Issues related to the migration from Epetra to Tpetra team: solvers labels Feb 14, 2025
@maxfirmbach
Copy link
Contributor

@vovannikov Thanks for starting to look at this so quickly. Maybe I can share my view. As far as I know, the single step thing is not used much at all, which basically explains the bad testing … to me it seems like a very special implementation for some Sandia internal application code. The Epetra::Group instead should have been in very frequent use … the flags try to make it more flexible, but thus also quite complicated … some codes only provide Jacobian+Residual calculation in one function … the flags are there to also cover those cases (I guess you already know 😅).

But I agree with all your points … mainly consistency is missing.

@maxfirmbach
Copy link
Contributor

In addition it would be interesting to see how important the single step functionality really is … not quite sure about it.

@vovannikov
Copy link
Contributor Author

@maxfirmbach I guess this functionality could be not the most frequently used one, but this is still a tricky aspect. If we want to have a generalized wrapper for the buildSolver() fabric from Trilinos - this is one approach, but if we need to exclude certain options from the parameters list, then we need to take care of this on our side - this is a different approach that looks like less neat. In my opinion, sacrificing this kind of basic generality (of course we should not overengineering too much) is a risk prone path that can lead to apperance of new hacks at a certain point.

@vryy
Copy link
Contributor

vryy commented Feb 18, 2025

@maxfirmbach the single step solver is used in the explicit time integration where Newton Raphson is not necessary.

@vovannikov I did not fully recall the details, but for the single step solver we tried to avoid calling computeNewton() again because it is called previously in the predictor step. These calls that you mentioned as hack is the only way I found to work when using NOX::Solver::SingleStep. Of course, if there is a better way to do that, I'm happy to switch.

Here the author optimized memory usage. Well, const_cast is not nice, but what makes it worse...

I did not get your point here. The original intention is not memory optimization, but to tell explicitly that we do not change the direction vector in the process, i.e., group().getX(), at least in the current call context. The compiler may optimize the memory, then subsequent call may lead to segmentation fault error?? then it's the design problem of the calling package, here is NOX and I would suggest to fix it from there.

@vovannikov
Copy link
Contributor Author

@vryy thank you for your comments. Indeed, explicit time integration is an important use case.

In fact, not all predictors call computeNewton() (and this is absolutely fine, why should they be obliged to do so?) and that is exactly why the solver crashes without having flag isValidNewton manully set up as currently done in NOX::Solver::SingleStep. In fact, it is quite understandable why you chose this way to resolve this issue: you may see that the other 2 options I listed require modifications to be made in Trilinos and the remaining 2 are quite big fundamental modifications which would propagate outside of that specific class NOX::Solver::SingleStep.

By memory usage optimization I meant this trick in NOX itself:

  // Reuse memory in group instead of new allocation for dir
  NOX::Abstract::Vector& dir = const_cast<NOX::Abstract::Vector&>(solnPtr->getNewton());

https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src/NOX_Solver_SingleStep.C#L131-L138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
taskforce: tpetra Issues related to the migration from Epetra to Tpetra team: solvers type: enhancement A new feature or enhancement to be implemented
Projects
None yet
Development

No branches or pull requests

4 participants