Timeout clarification/issue #1146

jennysjaarda · 2020-01-23T09:21:23Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
If you think your question has a quick and definite answer, consider posting to Stack Overflow under the drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)

Question

Hi Will,

I'm running into some timeout failures. I see in the manual the section on timeouts suggests to play with the cpu, elapsed and retries arguments, but if I'm reading the defaults correctly - aren't they already set at Inf? This is my issue:

I have a plan kind of like this:

plan <- drake_plan(

my_map = tibble(i = 1:1000),
run_something_in_parallel1 = target(long_func1(define_map$i), dynamic = map(my_map), #run on workers
aggregate = target({
run_something_in_parallel1 #these results are a dependency of `aggregate` 
write_results()
}, hpc = FALSE), #does not run on workers but still takes some time (~20 minutes)
run_something_in_parallel2 = target(long_func2(define_map$i), dynamic = map(my_map), #run on workers

)
make(plan,parallelism = "clustermq",console_log_file = "log.out",
  memory_strategy = "lookahead", garbage_collection = TRUE, ## try with preclean ?
  jobs = 100, template = list(cpus = 1, partition = "sgg",
  log_file="log_%a_clustermq.out"))

run_something_in_parallel1 is running on workers as expected and aggregate is running on my local session, but then I see the workers timeout and when it comes time to create run_something_in_parallel2, this is only built locally. What parameter should I change to avoid this, since they are already at Inf? My slurm out file looks like this:

> clustermq:::worker("tcp://node16:6345")
2020-01-23 08:53:58.924755 | Master: tcp://node16:6345
2020-01-23 08:53:58.939260 | WORKER_UP to: tcp://node16:6345
Error in clustermq:::worker("tcp://node16:6345") : 
  Timeout reached, terminating
Execution halted

The issue here also discusses this error, but I think it might be a different case.

The text was updated successfully, but these errors were encountered:

jennysjaarda · 2020-01-23T09:26:23Z

If I kill the R job and restart it again, run_something_in_parallel2 registers on the workers as expected as make recognizes that run_something_in_parallel1 and aggregate are up to date.

jennysjaarda · 2020-01-23T11:49:52Z

Maybe I should add that if I let the plan run through to the end I get the following warning message:

Warning in config$workers$cleanup() :
  74/98 workers did not shut down properly
Master: [2113.3s 29.6% CPU]; Worker: [avg 0.0% CPU, max 248.1 Mb]

wlandau · 2020-01-23T18:38:51Z

As you say, re #170, this is an issue with clustermq. Looks like a solution has been proposed: mschubert/clustermq#172. It is only a matter of time before the maintainer gets to it.

wlandau · 2020-01-24T14:08:11Z

To be clear, the problems you face are related to clustermq/ZeroMQ timeouts, not drake's timeouts (e.g. the elapsed or cpu arguments of make()).

wlandau · 2020-01-27T02:14:29Z

Hmm... from mschubert/clustermq#172 (comment), it looks like the PR will not be merged after all. I recommend following up with @mschubert to see how he prefers to resolve this.

mschubert · 2020-01-27T09:08:21Z

This looks like a timeout and not an interface issue, so this is independent of clustermq#172. I have changed the title of this issue to be clearer about what the problem there is.

What I think is happening is that drake sends a lot of data, and the workers only wait for up to 10 minutes to receive it.

Can you try if the timeout branch fixes it? (this increases the worker timeout to 1 hour)

devtools::install_github("mschubert/clustermq@timeout")

jennysjaarda · 2020-01-27T11:02:50Z

Sure thing, just to make sure I have the right version though, do I just load as normal? i.e. library(clustermq). When I check the version with packageVersion it is the same as before ‘0.8.8.1’

…

On 27 Jan 2020, at 10:08, Michael Schubert ***@***.***> wrote: This looks like a timeout and not an interface issue, so this is independent of clustermq#172 <mschubert/clustermq#170>. I have changed the title of this issue to be clearer about what the problem there is. What I think is happening is that drake sends a lot of data, and the workers only wait for up to 10 minutes to receive it. Can you try if the timeout branch fixes it? ***@***.***") — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1146?email_source=notifications&email_token=AITJSVVS75QTAY66DT4RIKTQ72QANA5CNFSM4KKTFB62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ6Y7NI#issuecomment-578654133>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AITJSVX4FKTRK2Y46QZNWELQ72QANANCNFSM4KKTFB6Q>.

mschubert · 2020-01-27T11:36:28Z

That's fine, if you do

head(clustermq:::worker)

it should show timeout=3600 instead of timeout=600

jennysjaarda · 2020-01-27T11:40:21Z

Yes that’s true, it does show the timeout at 3600 now. I suppose there could still be cases where that is not enough though? If we have really long non-dynamic targets that are not using the nodes and downstream dynamic targets that need to use the nodes… In this case I guess the plan could be rearranged so that resources aren’t being wasted. Or is there anyway to restart the nodes midway through a drake plan?

…

On 27 Jan 2020, at 12:36, Michael Schubert ***@***.***> wrote: That's fine, if you do head(clustermq:::worker) it should show timeout=3600 instead of timeout=600 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1146?email_source=notifications&email_token=AITJSVQJKCAU4A7NDUBT3ZLQ73BL3A5CNFSM4KKTFB62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ7GJTQ#issuecomment-578708686>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AITJSVXU6E7QWVE2XZ3KQ3TQ73BL3ANCNFSM4KKTFB6Q>.

wlandau · 2020-01-27T13:15:16Z

Successive make()s restart the workflow beginning at incomplete/outdated targets, which could help. I do not know of another way to restart workers. Related: #349, mschubert/clustermq#101.

strazto · 2020-01-30T03:46:24Z

I'm also getting this, my worker nodes are expiring fairly frequently with the following lockfile entry for clustermq:

"clustermq": {
      "Package": "clustermq",
      "Version": "0.8.8",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "644dc578f786be4e69f0281c1246e1e6"
}

I am updating to the following, as per @mschubert 's suggestion:

"clustermq": {
      "Package": "clustermq",
      "Version": "0.8.8.1",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteHost": "api.github.com",
      "RemoteRepo": "clustermq",
      "RemoteUsername": "mschubert",
      "RemoteRef": "timeout",
      "RemoteSha": "9ba37a3cd17d82e39fda8133a4e1cd36cc76b50d",
      "Hash": "c3593195b1eddd92c2db94ded132eb9f"
    },

Below is a typical log file for a worker in a job array - under the CRAN version
example_worker_log.log

I'll rerun the workflow using the timeout branch and update on the result

mschubert · 2020-01-30T12:27:25Z

@wlandau, if you have a bottleneck (e.g. 10 workers but only one job that can be processed), do you send WORKER_WAIT via the API or do you not send any signal until this one job is done?

The latter would also explain the timeouts (but this should be fixed with moving away from timeouts in 0.9 in any case).

wlandau · 2020-01-31T13:53:11Z

@wlandau, if you have a bottleneck (e.g. 10 workers but only one job that can be processed), do you send WORKER_WAIT via the API or do you not send any signal until this one job is done?

If there are dependent jobs waiting for that one to finish, then yes I do. If all the jobs are running or done, those workers terminate.

drake/R/backend_clustermq.R

Lines 79 to 83 in 757497f

    
           } else if (!config$queue$empty()) { 
        
             cmq_next_target(config) 
        
           } else { 
        
             config$workers$send_shutdown_worker() 
        
           }

drake/R/backend_clustermq.R

Lines 115 to 126 in 757497f

    
           target <- config$queue$pop0() 
        
           # Longer tests will catch this: 
        
           if (!length(target)) { 
        
             config$workers$send_wait() # nocov 
        
             return() # nocov 
        
           } 
        
           if (no_hpc(target, config)) { 
        
             config$workers$send_wait() 
        
             cmq_local_build(target, config) 
        
           } else { 
        
             cmq_send_target(target, config) 
        
           }

The latter would also explain the timeouts (but this should be fixed with moving away from timeouts in 0.9 in any case).

Awesome! I am looking forward to 0.9.

jennysjaarda added the type: question label Jan 23, 2020

wlandau closed this as completed Jan 23, 2020

This was referenced Jan 27, 2020

Hanging within dynamic targets #1150

Closed

Parallel make randomly hangs #1148

Closed

User-specified worker timeouts mschubert/clustermq#188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout clarification/issue #1146

Timeout clarification/issue #1146

jennysjaarda commented Jan 23, 2020 •

edited

Loading

jennysjaarda commented Jan 23, 2020

jennysjaarda commented Jan 23, 2020

wlandau commented Jan 23, 2020

wlandau commented Jan 24, 2020

wlandau commented Jan 27, 2020

mschubert commented Jan 27, 2020 •

edited

Loading

jennysjaarda commented Jan 27, 2020 via email

mschubert commented Jan 27, 2020

jennysjaarda commented Jan 27, 2020 via email

wlandau commented Jan 27, 2020

strazto commented Jan 30, 2020

mschubert commented Jan 30, 2020

wlandau commented Jan 31, 2020

Timeout clarification/issue #1146

Timeout clarification/issue #1146

Comments

jennysjaarda commented Jan 23, 2020 • edited Loading

Prework

Question

jennysjaarda commented Jan 23, 2020

jennysjaarda commented Jan 23, 2020

wlandau commented Jan 23, 2020

wlandau commented Jan 24, 2020

wlandau commented Jan 27, 2020

mschubert commented Jan 27, 2020 • edited Loading

jennysjaarda commented Jan 27, 2020 via email

mschubert commented Jan 27, 2020

jennysjaarda commented Jan 27, 2020 via email

wlandau commented Jan 27, 2020

strazto commented Jan 30, 2020

mschubert commented Jan 30, 2020

wlandau commented Jan 31, 2020

jennysjaarda commented Jan 23, 2020 •

edited

Loading

mschubert commented Jan 27, 2020 •

edited

Loading