[NOX] Clean up of `NOX::Solver::SingleStep` #379

vovannikov · 2025-02-14T14:30:08Z

Description

The more I look into the code, the more I figure out that certain "non-optimal" design desicions in 4C were made due to the mistakes in the design of Trilinos itself. And I think this is the right time to figure out how they should be tackled properly.

There is SingleStep solver in 4C which is based on NOX::Solver::SingleStep from Trilinos and its solve() function looks like this:

4C/src/structure_new/src/nonlinear_solver/4C_structure_new_nln_solver_singlestep.cpp

Lines 103 to 124 in 67d3f53

    
           enum Inpar::Solid::ConvergenceStatus Solid::Nln::SOLVER::SingleStep::solve() 
        
           { 
        
             check_init_setup(); 
        
             auto& nln_group = dynamic_cast<NOX::Nln::Group&>(group()); 
        
             const auto& x_epetra = dynamic_cast<const ::NOX::Epetra::Vector&>(group().getX()); 
        
             nln_group.set_is_valid_newton(true);  // to circumvent the check in ::NOX::Solver::SingleStep 
        
             nln_group.set_is_valid_rhs(false);    // force to compute the RHS 
        
             nln_group.reset_x();                  // to initialize the solution vector to zero 
        
             //// do one non-linear step using solve 
        
             ::NOX::StatusTest::StatusType stepstatus = nlnsolver_->solve(); 
        
             // copy the solution group into the class variable 
        
             group() = nlnsolver_->getSolutionGroup(); 
        
             integrator().set_state(Core::LinAlg::Vector<double>(x_epetra.getEpetraVector())); 
        
             return convert_final_status(stepstatus); 
        
           }

You may see that the step() function is customized with some strange calls:

  nln_group.set_is_valid_newton(true);  // to circumvent the check in ::NOX::Solver::SingleStep
  nln_group.set_is_valid_rhs(false);    // force to compute the RHS
  nln_group.reset_x();                  // to initialize the solution vector to zero

Note that set_is_valid_newton() and set_is_valid_rhs() are used only in the implementation of the single step solver, not anywhere else in the code, which makes me think that it was a clear hack to resolve some problem.

Indeed, this implementation is dictated by this piece of code
https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src/NOX_Solver_SingleStep.C#L131-L138

Here the author optimized memory usage. Well, const_cast is not nice, but what makes it worse is that getNewton() can thrown an exception:
https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src-epetra/NOX_Epetra_Group.C#L547-L555

And the only way to set flag isValidNewton to true is to call Group::computeNewton(), and that call is not performed in NOX::Solver::SingleStep. Aparently, this makes NOX::Epetra::Group absolutely incompatible with NOX::Solver::SingleStep. And what it makes the situation even worse is that this behavior is not consistent between different group implementations in Trilinos itself. For instance, getNewton() in Thyra::Group can not throw a similar exception because it has not similar checks inside: https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src-thyra/NOX_Thyra_Group.C#L750-L753. Well, in fact, it can throw an exception, but due to dereferencing a nullptr (if that happens), but not due that the flag isNewtonValid=false.

Then a question may arrise: how did Trilinos authors not notice this ever? I think the answer is that they have never used Epetra::Group a lot themselves and mainly used Thyra, at least there is only one test in Trilinos for the single step solver and it uses Thyra: https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/test/epetra/Thyra/Thyra_SingleStepSolver_1D.C.

Possible solutions

Do not use this memory optimization in SingleStep solver in Trilinos, should be harmless except from the slightly larger memory consumption, the vector can be created in ctor and then reused. I would go for this option, I think this is a bad design decision in Trilinos.
Remove check isNewton() from Epetra::Group class in Trilinos, but that can be risky since this is a very old legacy code and we do not know whether other codes rely on this behavior.
~~Continue using hacks in 4C.~~ (no way)
Use our own group class.
Wrappers around Trilinos classes and no inheritance from Trilinos classes (like @sebproell did for LinAlg::Vector).

Conclusion

This is a single example. I bet there are more of this, if I keep on tracking the code. While at the beginning I was a bit sceptical about getting rid of Epetra::Group as one of the first steps, I think that the need for switching to our own group will get more and more pressing. Epetra::Group is very much stateful, its behavior is so much dependent on the interplay between its flags (which are a lot), that makes it difficult to use it properly for various solution scenarious which may imply evaluation of only certain parts of the group.

I think that probably switching to our own group would still be a bit too much at the moment, since some parts of the code can still be optimized without this move.

@vryy @mayrmt @amgebauer @sebproell @maxfirmbach @georghammerl

The text was updated successfully, but these errors were encountered:

maxfirmbach · 2025-02-14T18:50:52Z

@vovannikov Thanks for starting to look at this so quickly. Maybe I can share my view. As far as I know, the single step thing is not used much at all, which basically explains the bad testing … to me it seems like a very special implementation for some Sandia internal application code. The Epetra::Group instead should have been in very frequent use … the flags try to make it more flexible, but thus also quite complicated … some codes only provide Jacobian+Residual calculation in one function … the flags are there to also cover those cases (I guess you already know 😅).

But I agree with all your points … mainly consistency is missing.

maxfirmbach · 2025-02-14T18:55:41Z

In addition it would be interesting to see how important the single step functionality really is … not quite sure about it.

vovannikov · 2025-02-14T22:18:59Z

@maxfirmbach I guess this functionality could be not the most frequently used one, but this is still a tricky aspect. If we want to have a generalized wrapper for the buildSolver() fabric from Trilinos - this is one approach, but if we need to exclude certain options from the parameters list, then we need to take care of this on our side - this is a different approach that looks like less neat. In my opinion, sacrificing this kind of basic generality (of course we should not overengineering too much) is a risk prone path that can lead to apperance of new hacks at a certain point.

vryy · 2025-02-18T21:56:09Z

@maxfirmbach the single step solver is used in the explicit time integration where Newton Raphson is not necessary.

@vovannikov I did not fully recall the details, but for the single step solver we tried to avoid calling computeNewton() again because it is called previously in the predictor step. These calls that you mentioned as hack is the only way I found to work when using NOX::Solver::SingleStep. Of course, if there is a better way to do that, I'm happy to switch.

Here the author optimized memory usage. Well, const_cast is not nice, but what makes it worse...

I did not get your point here. The original intention is not memory optimization, but to tell explicitly that we do not change the direction vector in the process, i.e., group().getX(), at least in the current call context. The compiler may optimize the memory, then subsequent call may lead to segmentation fault error?? then it's the design problem of the calling package, here is NOX and I would suggest to fix it from there.

vovannikov · 2025-02-18T22:41:03Z

@vryy thank you for your comments. Indeed, explicit time integration is an important use case.

In fact, not all predictors call computeNewton() (and this is absolutely fine, why should they be obliged to do so?) and that is exactly why the solver crashes without having flag isValidNewton manully set up as currently done in NOX::Solver::SingleStep. In fact, it is quite understandable why you chose this way to resolve this issue: you may see that the other 2 options I listed require modifications to be made in Trilinos and the remaining 2 are quite big fundamental modifications which would propagate outside of that specific class NOX::Solver::SingleStep.

By memory usage optimization I meant this trick in NOX itself:

  // Reuse memory in group instead of new allocation for dir
  NOX::Abstract::Vector& dir = const_cast<NOX::Abstract::Vector&>(solnPtr->getNewton());

https://github.com/trilinos/Trilinos/blob/62ff542b83d63ad9a56b91ef7af177c485459dc2/packages/nox/src/NOX_Solver_SingleStep.C#L131-L138

vovannikov added the type: enhancement A new feature or enhancement to be implemented label Feb 14, 2025

sebproell added taskforce: tpetra Issues related to the migration from Epetra to Tpetra team: solvers labels Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NOX] Clean up of `NOX::Solver::SingleStep` #379

[NOX] Clean up of `NOX::Solver::SingleStep` #379

vovannikov commented Feb 14, 2025 •

edited

Loading

maxfirmbach commented Feb 14, 2025

maxfirmbach commented Feb 14, 2025

vovannikov commented Feb 14, 2025

vryy commented Feb 18, 2025

vovannikov commented Feb 18, 2025

[NOX] Clean up of NOX::Solver::SingleStep #379

[NOX] Clean up of NOX::Solver::SingleStep #379

Comments

vovannikov commented Feb 14, 2025 • edited Loading

Description

Possible solutions

Conclusion

maxfirmbach commented Feb 14, 2025

maxfirmbach commented Feb 14, 2025

vovannikov commented Feb 14, 2025

vryy commented Feb 18, 2025

vovannikov commented Feb 18, 2025

[NOX] Clean up of `NOX::Solver::SingleStep` #379

[NOX] Clean up of `NOX::Solver::SingleStep` #379

vovannikov commented Feb 14, 2025 •

edited

Loading