Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redeploy workers that time out too soon? #101

Closed
wlandau opened this issue Aug 24, 2018 · 2 comments
Closed

Redeploy workers that time out too soon? #101

wlandau opened this issue Aug 24, 2018 · 2 comments

Comments

@wlandau
Copy link
Contributor

wlandau commented Aug 24, 2018

In my experience, HPC systems in academic settings can have very restrictive wall time limits. It may be difficult in these environments to follow your recommendation to keep the same pool of reserved workers for an entire end-to-end project.

You may have already implemented workarounds, I do not know. But just in case, here are a couple ideas.

  1. If a worker times out, launch a new one and make it attempt the work of its predecessor. In fact, it may be nice to do this for crashed workers in general for a given number of retries.
  2. If a worker has been running for a certain (user-defined) length of time, make it restart before accepting any new jobs. This would be amazing to have for drake.
@mschubert
Copy link
Owner

My first thought about this is: I have never come across a system that you couldn't request at least a couple of days worth of walltime.

I'm inclined to put this off as user responsibility to request the appropriate time, or process their workflow in chunks that fit in it.

However, if this affects many users I'd be willing to reconsider.

@wlandau
Copy link
Contributor Author

wlandau commented Oct 5, 2018

Related: ropensci/drake#349

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants