Parallel make randomly hangs #1148

mike-lawrence · 2020-01-24T19:48:36Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
Advanced users: verify that the bug still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.

Description

Apologies for the absent reprex, but I'm having trouble creating one. I have a plan with lots of dynamic targets that runs fine in serial but whenever I try to run in parallel I encounter hangs at seemingly random stages of the make build. This occurs both in RStudio and also running R from the unix command line, and occurs ~~with both parallel backend options~~ with only the clustermq parallel backend. When I keep an eye on my system resource usage during the build, right before the hang the memory usage jumps up dramatically to 100%. When I kill and run drake::make again, the target that was computing at the moment of the hang will re-start and memory usually then just stays at regular levels (50%-ish) until I encounter a hang in a random later target. If you have any suggestions on how I might build a reprex for this one or further investigate what's going awry, I'm happy to provide more info.

Session info

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 19.10
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
#>  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
#>  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
#>  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.16 
#>  [9] highr_0.8       knitr_1.25      stringr_1.4.0   xfun_0.10      
#> [13] digest_0.6.23   rlang_0.4.2     evaluate_0.14

#get drake github sha1
drake_description = readLines(system.file("DESCRIPTION", package = "drake"))
drake_description[substr(drake_description,1,10)=='GithubSHA1']
#> [1] "GithubSHA1: 59e67a7a14bce66aa20afad726429280497325fd"

^{Created on 2020-01-24 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

mike-lawrence · 2020-01-24T20:08:32Z

I don't know if this helps narrow down things, but I just switched from , memory_strategy = 'autoclean' to memory_strategy = 'speed' and instead of hanging the process self-kills (running in terminal).

mike-lawrence · 2020-01-24T20:36:57Z

Correction: these hangs/crashes don’t occur with the future backend, just clustermq. I do note that the future backend seems far slower; in my system monitor it seems to only have a 3-4 processes active at a time despite requesting 12 jobs (on a 6-core hyperthreading cpu), while with clustermq I get precisely the number of requested jobs active.

wlandau · 2020-01-25T01:40:27Z

Would it be possible to post a reprex? Also, can you replicate the hanging using just clustermq and not drake? Either way, it would also help to see the worker logs. With options(clustermq.scheduler = "multicore"), you can set verbose = TRUE in clustermq::Q(). For other schedulers, you can set the log file in the template: include a log_file placeholder in the template file and call make() with template = list(log_file = "your_template.tmpl").

wlandau · 2020-01-27T17:39:16Z

Looks like this is due to clustermq worker timeouts: #1146, #1150, and mschubert/clustermq#101. Unfortunately, there is nothing I can do in drake itself.

mike-lawrence · 2020-01-28T14:14:52Z

Thanks for looking into this! Would you still like to see the worker logs?

wlandau · 2020-01-29T13:34:18Z

I wouldn't work too hard on it. The logs might help later on, so if it won't take too much work on your end, then sure. Otherwise, I think we can wait for developments in clustermq.

mike-lawrence · 2020-03-12T21:51:07Z

For posterity, this is now fixed by setting a high timeout via options(clustermq.worker.timeout = my_timeout_val).

mike-lawrence added the type: bug label Jan 24, 2020

mike-lawrence assigned wlandau Jan 24, 2020

wlandau added type: trouble depends: reprex and removed type: bug labels Jan 25, 2020

wlandau mentioned this issue Jan 27, 2020

Hanging within dynamic targets #1150

Closed

3 tasks

wlandau closed this as completed Jan 27, 2020

wlandau mentioned this issue Jan 27, 2020

User-specified worker timeouts mschubert/clustermq#188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel make randomly hangs #1148

Parallel make randomly hangs #1148

mike-lawrence commented Jan 24, 2020 •

edited

Loading

mike-lawrence commented Jan 24, 2020

mike-lawrence commented Jan 24, 2020

wlandau commented Jan 25, 2020

wlandau commented Jan 27, 2020 •

edited

Loading

mike-lawrence commented Jan 28, 2020

wlandau commented Jan 29, 2020

mike-lawrence commented Mar 12, 2020

Parallel make randomly hangs #1148

Parallel make randomly hangs #1148

Comments

mike-lawrence commented Jan 24, 2020 • edited Loading

Prework

Description

Session info

mike-lawrence commented Jan 24, 2020

mike-lawrence commented Jan 24, 2020

wlandau commented Jan 25, 2020

wlandau commented Jan 27, 2020 • edited Loading

mike-lawrence commented Jan 28, 2020

wlandau commented Jan 29, 2020

mike-lawrence commented Mar 12, 2020

mike-lawrence commented Jan 24, 2020 •

edited

Loading

wlandau commented Jan 27, 2020 •

edited

Loading