-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout clarification/issue #1146
Comments
If I kill the |
Maybe I should add that if I let the plan run through to the end I get the following warning message:
|
As you say, re #170, this is an issue with |
To be clear, the problems you face are related to |
Hmm... from mschubert/clustermq#172 (comment), it looks like the PR will not be merged after all. I recommend following up with @mschubert to see how he prefers to resolve this. |
This looks like a timeout and not an interface issue, so this is independent of clustermq#172. I have changed the title of this issue to be clearer about what the problem there is. What I think is happening is that drake sends a lot of data, and the workers only wait for up to 10 minutes to receive it. Can you try if the devtools::install_github("mschubert/clustermq@timeout") |
Sure thing, just to make sure I have the right version though, do I just load as normal? i.e. library(clustermq). When I check the version with packageVersion it is the same as before ‘0.8.8.1’
… On 27 Jan 2020, at 10:08, Michael Schubert ***@***.***> wrote:
This looks like a timeout and not an interface issue, so this is independent of clustermq#172 <mschubert/clustermq#170>. I have changed the title of this issue to be clearer about what the problem there is.
What I think is happening is that drake sends a lot of data, and the workers only wait for up to 10 minutes to receive it.
Can you try if the timeout branch fixes it?
***@***.***")
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1146?email_source=notifications&email_token=AITJSVVS75QTAY66DT4RIKTQ72QANA5CNFSM4KKTFB62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ6Y7NI#issuecomment-578654133>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AITJSVX4FKTRK2Y46QZNWELQ72QANANCNFSM4KKTFB6Q>.
|
That's fine, if you do head(clustermq:::worker) it should show |
Yes that’s true, it does show the timeout at 3600 now. I suppose there could still be cases where that is not enough though? If we have really long non-dynamic targets that are not using the nodes and downstream dynamic targets that need to use the nodes… In this case I guess the plan could be rearranged so that resources aren’t being wasted. Or is there anyway to restart the nodes midway through a drake plan?
… On 27 Jan 2020, at 12:36, Michael Schubert ***@***.***> wrote:
That's fine, if you do
head(clustermq:::worker)
it should show timeout=3600 instead of timeout=600
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1146?email_source=notifications&email_token=AITJSVQJKCAU4A7NDUBT3ZLQ73BL3A5CNFSM4KKTFB62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ7GJTQ#issuecomment-578708686>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AITJSVXU6E7QWVE2XZ3KQ3TQ73BL3ANCNFSM4KKTFB6Q>.
|
Successive |
I'm also getting this, my worker nodes are expiring fairly frequently with the following lockfile entry for clustermq: "clustermq": {
"Package": "clustermq",
"Version": "0.8.8",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "644dc578f786be4e69f0281c1246e1e6"
} I am updating to the following, as per @mschubert 's suggestion: "clustermq": {
"Package": "clustermq",
"Version": "0.8.8.1",
"Source": "GitHub",
"RemoteType": "github",
"RemoteHost": "api.github.com",
"RemoteRepo": "clustermq",
"RemoteUsername": "mschubert",
"RemoteRef": "timeout",
"RemoteSha": "9ba37a3cd17d82e39fda8133a4e1cd36cc76b50d",
"Hash": "c3593195b1eddd92c2db94ded132eb9f"
}, Below is a typical log file for a worker in a job array - under the CRAN version I'll rerun the workflow using the timeout branch and update on the result |
@wlandau, if you have a bottleneck (e.g. 10 workers but only one job that can be processed), do you send The latter would also explain the timeouts (but this should be fixed with moving away from timeouts in |
If there are dependent jobs waiting for that one to finish, then yes I do. If all the jobs are running or done, those workers terminate. Lines 79 to 83 in 757497f
Lines 115 to 126 in 757497f
Awesome! I am looking forward to |
Prework
drake
's code of conduct.drake-r-package
tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)Question
Hi Will,
I'm running into some timeout failures. I see in the manual the section on timeouts suggests to play with the
cpu
,elapsed
andretries
arguments, but if I'm reading the defaults correctly - aren't they already set atInf
? This is my issue:I have a plan kind of like this:
run_something_in_parallel1
is running on workers as expected andaggregate
is running on my local session, but then I see the workers timeout and when it comes time to createrun_something_in_parallel2
, this is only built locally. What parameter should I change to avoid this, since they are already atInf
? Myslurm
out file looks like this:The issue here also discusses this error, but I think it might be a different case.
The text was updated successfully, but these errors were encountered: