You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my experience, HPC systems in academic settings can have very restrictive wall time limits. It may be difficult in these environments to follow your recommendation to keep the same pool of reserved workers for an entire end-to-end project.
You may have already implemented workarounds, I do not know. But just in case, here are a couple ideas.
If a worker times out, launch a new one and make it attempt the work of its predecessor. In fact, it may be nice to do this for crashed workers in general for a given number of retries.
If a worker has been running for a certain (user-defined) length of time, make it restart before accepting any new jobs. This would be amazing to have for drake.
The text was updated successfully, but these errors were encountered:
In my experience, HPC systems in academic settings can have very restrictive wall time limits. It may be difficult in these environments to follow your recommendation to keep the same pool of reserved workers for an entire end-to-end project.
You may have already implemented workarounds, I do not know. But just in case, here are a couple ideas.
drake
.The text was updated successfully, but these errors were encountered: