Redeploy workers that time out too soon? #101

wlandau · 2018-08-24T15:26:35Z

In my experience, HPC systems in academic settings can have very restrictive wall time limits. It may be difficult in these environments to follow your recommendation to keep the same pool of reserved workers for an entire end-to-end project.

You may have already implemented workarounds, I do not know. But just in case, here are a couple ideas.

If a worker times out, launch a new one and make it attempt the work of its predecessor. In fact, it may be nice to do this for crashed workers in general for a given number of retries.
If a worker has been running for a certain (user-defined) length of time, make it restart before accepting any new jobs. This would be amazing to have for drake.

The text was updated successfully, but these errors were encountered:

mschubert · 2018-09-08T16:57:40Z

My first thought about this is: I have never come across a system that you couldn't request at least a couple of days worth of walltime.

I'm inclined to put this off as user responsibility to request the appropriate time, or process their workflow in chunks that fit in it.

However, if this affects many users I'd be willing to reconsider.

wlandau · 2018-10-05T02:06:02Z

Related: ropensci/drake#349

mschubert added the idea label Oct 2, 2018

mschubert mentioned this issue Oct 4, 2018

sequential job submission #105

Closed

wlandau mentioned this issue Oct 27, 2018

Remove all non-clustermq parallel backends? ropensci/drake#561

Closed

This was referenced Jan 27, 2020

Timeout clarification/issue ropensci/drake#1146

Closed

Hanging within dynamic targets ropensci/drake#1150

Closed

Parallel make randomly hangs ropensci/drake#1148

Closed

wlandau mentioned this issue Jun 25, 2020

send a target to a second worker in clustermq parallelism ropensci/drake#1287

Closed

2 tasks

mschubert closed this as completed Mar 29, 2021

Repository owner locked and limited conversation to collaborators Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Redeploy workers that time out too soon? #101

Redeploy workers that time out too soon? #101

wlandau commented Aug 24, 2018

mschubert commented Sep 8, 2018

wlandau commented Oct 5, 2018

This issue was moved to a discussion.

This issue was moved to a discussion.

Redeploy workers that time out too soon? #101

Redeploy workers that time out too soon? #101

Comments

wlandau commented Aug 24, 2018

mschubert commented Sep 8, 2018

wlandau commented Oct 5, 2018

This issue was moved to a discussion.