Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel make randomly hangs #1148

Closed
3 tasks done
mike-lawrence opened this issue Jan 24, 2020 · 7 comments
Closed
3 tasks done

Parallel make randomly hangs #1148

mike-lawrence opened this issue Jan 24, 2020 · 7 comments

Comments

@mike-lawrence
Copy link

mike-lawrence commented Jan 24, 2020

Prework

Description

Apologies for the absent reprex, but I'm having trouble creating one. I have a plan with lots of dynamic targets that runs fine in serial but whenever I try to run in parallel I encounter hangs at seemingly random stages of the make build. This occurs both in RStudio and also running R from the unix command line, and occurs with both parallel backend options with only the clustermq parallel backend. When I keep an eye on my system resource usage during the build, right before the hang the memory usage jumps up dramatically to 100%. When I kill and run drake::make again, the target that was computing at the moment of the hang will re-start and memory usually then just stays at regular levels (50%-ish) until I encounter a hang in a random later target. If you have any suggestions on how I might build a reprex for this one or further investigate what's going awry, I'm happy to provide more info.

Session info

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 19.10
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
#>  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
#>  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
#>  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.16 
#>  [9] highr_0.8       knitr_1.25      stringr_1.4.0   xfun_0.10      
#> [13] digest_0.6.23   rlang_0.4.2     evaluate_0.14

#get drake github sha1
drake_description = readLines(system.file("DESCRIPTION", package = "drake"))
drake_description[substr(drake_description,1,10)=='GithubSHA1']
#> [1] "GithubSHA1: 59e67a7a14bce66aa20afad726429280497325fd"

Created on 2020-01-24 by the reprex package (v0.3.0)

@mike-lawrence
Copy link
Author

I don't know if this helps narrow down things, but I just switched from , memory_strategy = 'autoclean' to memory_strategy = 'speed' and instead of hanging the process self-kills (running in terminal).

@mike-lawrence
Copy link
Author

Correction: these hangs/crashes don’t occur with the future backend, just clustermq. I do note that the future backend seems far slower; in my system monitor it seems to only have a 3-4 processes active at a time despite requesting 12 jobs (on a 6-core hyperthreading cpu), while with clustermq I get precisely the number of requested jobs active.

@wlandau
Copy link
Member

wlandau commented Jan 25, 2020

Would it be possible to post a reprex? Also, can you replicate the hanging using just clustermq and not drake? Either way, it would also help to see the worker logs. With options(clustermq.scheduler = "multicore"), you can set verbose = TRUE in clustermq::Q(). For other schedulers, you can set the log file in the template: include a log_file placeholder in the template file and call make() with template = list(log_file = "your_template.tmpl").

@wlandau
Copy link
Member

wlandau commented Jan 27, 2020

Looks like this is due to clustermq worker timeouts: #1146, #1150, and mschubert/clustermq#101. Unfortunately, there is nothing I can do in drake itself.

@mike-lawrence
Copy link
Author

Thanks for looking into this! Would you still like to see the worker logs?

@wlandau
Copy link
Member

wlandau commented Jan 29, 2020

I wouldn't work too hard on it. The logs might help later on, so if it won't take too much work on your end, then sure. Otherwise, I think we can wait for developments in clustermq.

@mike-lawrence
Copy link
Author

For posterity, this is now fixed by setting a high timeout via options(clustermq.worker.timeout = my_timeout_val).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants