Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout clarification/issue #1146

Closed
3 tasks done
jennysjaarda opened this issue Jan 23, 2020 · 13 comments
Closed
3 tasks done

Timeout clarification/issue #1146

jennysjaarda opened this issue Jan 23, 2020 · 13 comments

Comments

@jennysjaarda
Copy link

jennysjaarda commented Jan 23, 2020

Prework

  • Read and abide by drake's code of conduct.
  • Search for duplicates among the existing issues, both open and closed.
  • If you think your question has a quick and definite answer, consider posting to Stack Overflow under the drake-r-package tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)

Question

Hi Will,

I'm running into some timeout failures. I see in the manual the section on timeouts suggests to play with the cpu, elapsed and retries arguments, but if I'm reading the defaults correctly - aren't they already set at Inf? This is my issue:

I have a plan kind of like this:

plan <- drake_plan(

my_map = tibble(i = 1:1000),
run_something_in_parallel1 = target(long_func1(define_map$i), dynamic = map(my_map), #run on workers
aggregate = target({
run_something_in_parallel1 #these results are a dependency of `aggregate` 
write_results()
}, hpc = FALSE), #does not run on workers but still takes some time (~20 minutes)
run_something_in_parallel2 = target(long_func2(define_map$i), dynamic = map(my_map), #run on workers

)
make(plan,parallelism = "clustermq",console_log_file = "log.out",
  memory_strategy = "lookahead", garbage_collection = TRUE, ## try with preclean ?
  jobs = 100, template = list(cpus = 1, partition = "sgg",
  log_file="log_%a_clustermq.out"))

run_something_in_parallel1 is running on workers as expected and aggregate is running on my local session, but then I see the workers timeout and when it comes time to create run_something_in_parallel2, this is only built locally. What parameter should I change to avoid this, since they are already at Inf? My slurm out file looks like this:

> clustermq:::worker("tcp://node16:6345")
2020-01-23 08:53:58.924755 | Master: tcp://node16:6345
2020-01-23 08:53:58.939260 | WORKER_UP to: tcp://node16:6345
Error in clustermq:::worker("tcp://node16:6345") : 
  Timeout reached, terminating
Execution halted

The issue here also discusses this error, but I think it might be a different case.

@jennysjaarda
Copy link
Author

If I kill the R job and restart it again, run_something_in_parallel2 registers on the workers as expected as make recognizes that run_something_in_parallel1 and aggregate are up to date.

@jennysjaarda
Copy link
Author

Maybe I should add that if I let the plan run through to the end I get the following warning message:

Warning in config$workers$cleanup() :
  74/98 workers did not shut down properly
Master: [2113.3s 29.6% CPU]; Worker: [avg 0.0% CPU, max 248.1 Mb]

@wlandau
Copy link
Member

wlandau commented Jan 23, 2020

As you say, re #170, this is an issue with clustermq. Looks like a solution has been proposed: mschubert/clustermq#172. It is only a matter of time before the maintainer gets to it.

@wlandau wlandau closed this as completed Jan 23, 2020
@wlandau
Copy link
Member

wlandau commented Jan 24, 2020

To be clear, the problems you face are related to clustermq/ZeroMQ timeouts, not drake's timeouts (e.g. the elapsed or cpu arguments of make()).

@wlandau
Copy link
Member

wlandau commented Jan 27, 2020

Hmm... from mschubert/clustermq#172 (comment), it looks like the PR will not be merged after all. I recommend following up with @mschubert to see how he prefers to resolve this.

@mschubert
Copy link

mschubert commented Jan 27, 2020

This looks like a timeout and not an interface issue, so this is independent of clustermq#172. I have changed the title of this issue to be clearer about what the problem there is.

What I think is happening is that drake sends a lot of data, and the workers only wait for up to 10 minutes to receive it.

Can you try if the timeout branch fixes it? (this increases the worker timeout to 1 hour)

devtools::install_github("mschubert/clustermq@timeout")

@jennysjaarda
Copy link
Author

jennysjaarda commented Jan 27, 2020 via email

@mschubert
Copy link

That's fine, if you do

head(clustermq:::worker)

it should show timeout=3600 instead of timeout=600

@jennysjaarda
Copy link
Author

jennysjaarda commented Jan 27, 2020 via email

@wlandau
Copy link
Member

wlandau commented Jan 27, 2020

Successive make()s restart the workflow beginning at incomplete/outdated targets, which could help. I do not know of another way to restart workers. Related: #349, mschubert/clustermq#101.

@strazto
Copy link
Contributor

strazto commented Jan 30, 2020

I'm also getting this, my worker nodes are expiring fairly frequently with the following lockfile entry for clustermq:

"clustermq": {
      "Package": "clustermq",
      "Version": "0.8.8",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "644dc578f786be4e69f0281c1246e1e6"
}

I am updating to the following, as per @mschubert 's suggestion:

"clustermq": {
      "Package": "clustermq",
      "Version": "0.8.8.1",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteHost": "api.github.com",
      "RemoteRepo": "clustermq",
      "RemoteUsername": "mschubert",
      "RemoteRef": "timeout",
      "RemoteSha": "9ba37a3cd17d82e39fda8133a4e1cd36cc76b50d",
      "Hash": "c3593195b1eddd92c2db94ded132eb9f"
    },

Below is a typical log file for a worker in a job array - under the CRAN version
example_worker_log.log

I'll rerun the workflow using the timeout branch and update on the result

@mschubert
Copy link

@wlandau, if you have a bottleneck (e.g. 10 workers but only one job that can be processed), do you send WORKER_WAIT via the API or do you not send any signal until this one job is done?

The latter would also explain the timeouts (but this should be fixed with moving away from timeouts in 0.9 in any case).

@wlandau
Copy link
Member

wlandau commented Jan 31, 2020

@wlandau, if you have a bottleneck (e.g. 10 workers but only one job that can be processed), do you send WORKER_WAIT via the API or do you not send any signal until this one job is done?

If there are dependent jobs waiting for that one to finish, then yes I do. If all the jobs are running or done, those workers terminate.

} else if (!config$queue$empty()) {
cmq_next_target(config)
} else {
config$workers$send_shutdown_worker()
}

drake/R/backend_clustermq.R

Lines 115 to 126 in 757497f

target <- config$queue$pop0()
# Longer tests will catch this:
if (!length(target)) {
config$workers$send_wait() # nocov
return() # nocov
}
if (no_hpc(target, config)) {
config$workers$send_wait()
cmq_local_build(target, config)
} else {
cmq_send_target(target, config)
}

The latter would also explain the timeouts (but this should be fixed with moving away from timeouts in 0.9 in any case).

Awesome! I am looking forward to 0.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants