-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugging a future_lapply()-powered SLURM workflow #115
Comments
@kendonB thank you for the interest! Integeration of There is indeed functionality in Idea 1: Makefile parallelismThe idea is to have multiple calls to library(drake)
simulate <- function(n){
rnorm(n)
}
# workflow() is not yet defined in current CRAN release (4.3.0).
# It will replace drake::plan() due to a name conflict with future::plan().
# I will still keep drake::plan() exported and deprecated for a long time.
my_plan <- workflow(
primer = simulate(20),
data1 = primer + 1,
data2 = primer + 2,
result = mean(c(data1, data2))
)
my_plan ## target command
## 1 primer simulate(20)
## 2 data1 primer + 1
## 3 data2 primer + 2
## 4 result mean(c(data1, data2)) Suppose the datasets and the make(
plan = my_plan,
targets = c("data1", "data2"), # `primer` is built too
parallelism = "Makefile",
jobs = 2,
recipe_command = "echo 'low memory'; Rscript -e 'R_RECIPE'"
) ## check 1 item: rnorm
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'low memory'; Rscript -e 'drake::mk(target = "primer", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## target primer
## echo 'low memory'; Rscript -e 'drake::mk(target = "data1", cache_path = "/home/wlandau/Desktop/.drake")'
## echo 'low memory'; Rscript -e 'drake::mk(target = "data2", cache_path = "/home/wlandau/Desktop/.drake")'
## low memory
## low memory
## load 1 item: primer
## load 1 item: primer
## target data1
## target data2 make(
plan = my_plan,
targets = "result",
parallelism = "Makefile",
recipe_command = "echo 'high memory'; Rscript -e 'R_RECIPE'"
) ## check 3 items: c, mean, rnorm
## import c
## import mean
## import rnorm
## check 1 item: simulate
## import simulate
## echo 'high memory'; Rscript -e 'drake::mk(target = "result", cache_path = "/home/wlandau/Desktop/.drake")'
## high memory
## load 2 items: data1, data2
## target result Idea 2:
|
Also, unless you are using |
Another thing: what sort of native dependencies would you like to leverage in SLURM? The ways that
Does this meet your needs? |
Thanks for the detailed response. I think I should be able to figure this out now. I'm not sure how the drake would have to capture the jobids of the earlier jobs and plug them in. The advantage to using sbatch like this would be that the jobs only briefly rely on the host R process. All the jobs would get scheduled and live on SLURM right away. |
Yeah, it does sound like Please let me know how the rest of the setup goes. Since you said you should be able to figure it out now, I am closing this issue, but we can continue the thread if you like. |
@wlandau-lilly I'm trying to get this working now and both "ideas" above fail for me. First one I get the error library(drake)
simulate <- function(n){
rnorm(n)
print("simulating 3")
Sys.sleep(20)
}
my_plan <- workflow(
primer1 = simulate(20),
primer2 = simulate(10),
data1 = primer1 + 1,
data2 = primer2 + 2,
result = mean(c(data1, data2))
)
make(
plan = my_plan,
targets = c("data1", "data2"), # `primer` is built too
parallelism = "Makefile",
jobs = 2,
prepend = c(
"#!/bin/bash",
"#SBATCH -J testing",
"#SBATCH -A landcare00063",
"#SBATCH --time=1:00:00",
"#SBATCH --cpus-per-task=1",
"#SBATCH --begin=now",
"#SBATCH --mem=1G",
"#SBATCH -C sb",
"module load R"
),
recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator. Stop. Second one runs great and seems to nicely create multiple jobs on slurm. However, I can't seem to find where the log files end up so it's hard to see what actually happened. Do you know? @wlandau-lilly, did you miss this one? |
I've noticed a deal-breaking drawback with using Right now, I see:
which I presume is just a simple text processing task.
The scheduler isn't thrilled about allocating all those resources and thus the task takes far longer than it should. |
Yes, for |
By the way, it goes without saying that this is a super important thing for me to be aware of. Thank you for bringing it to my attention. |
By the way, if you have |
The fix seemed to work for the above problem. Great! Tried it again and got a pretty unhelpful error message. Does it make any sense to you? Error: BatchtoolsExpiration: Future ('<none>') expired (registry path /gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983).. The last few lines of the logged output:
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1469997983/jobs/job2f26f50cddc38aa1f2fc2bef4606efe9.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947498/slurm_script: line 22: 25373 Illegal instruction (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171
In addition: Warning message:
In waitForJobs(ids = jobid, timeout = timeout, sleep = sleep_fcn, :
Some jobs disappeared from the system
Digging further, I found the associated log file: ### [bt 2017-10-28 18:04:13]: This is batchtools v0.9.6
### [bt 2017-10-28 18:04:13]: Starting calculation of 1 jobs
### [bt 2017-10-28 18:04:13]: Setting working directory to '/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate'
Loading required package: drake
Loading required package: methods
### [bt 2017-10-28 18:04:13]: Memory measurement disabled
### [bt 2017-10-28 18:04:16]: Starting job [batchtools job.id=1]
*** caught illegal operation ***
address 0x2ae5ae328a68, cause 'illegal operand'
Traceback:
1: dyn.load(file, DLLpath = DLLpath, ...)
2: library.dynam(lib, package, package.lib)
3: loadNamespace(name)
4: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
7: tryCatch(loadNamespace(name), error = function(e) { warning(gettextf("namespace %s is not available and has been replaced\nby .GlobalEnv when processing object %s", sQuote(name)[1L], sQuote(where)), domain = NA, call. = FALSE, immediate. = TRUE) \
.GlobalEnv})
8: ..getNamespace(c("dplyr", "0.7.4"), "")
9: readRDS(self$name_hash(hash))
10: self$driver$get_object(hash)
11: self$get_value(self$get_hash(key, namespace), use_cache)
12: cache$get("config", namespace = "distributed")
13: ...future.FUN(...future.x_jj, ...)
14: FUN(X[[i]], ...)
15: lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...)})
16: (function (...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) })})(cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")
17: do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) })}, args = future.call.arguments)
18: eval(quote({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments)}), new.env())
19: eval(quote({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments)}), new.env())
20: eval(expr, p)
21: eval(expr, p)
22: eval.parent(substitute(eval(quote(expr), envir)))
23: local({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments)})
24: tryCatchList(expr, classes, parentenv, handlers)
25: tryCatch({ local({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.\
arguments) })}, finally = { { { NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), \
workers = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = templat\
e, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) }})
26: eval(quote({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMissing = "error") { { NULL local({ for (pkg in "drake") { lo\
adNamespace(pkg) library(pkg, character.only = TRUE) } }) } future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { \
lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { \
{ { NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), \
workers = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template\
, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) } })}), new.env())
27: eval(quote({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMissing = "error") { { NULL local({ for (pkg in "drake") { lo\
adNamespace(pkg) library(pkg, character.only = TRUE) } }) } future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { \
lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { \
{ { NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), \
workers = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template\
, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) } })}), new.env())
28: eval(expr, p)
29: eval(expr, p)
30: eval.parent(substitute(eval(quote(expr), envir)))
31: local({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMissing = "error") { { NULL local({ for (pkg in "drake") { loadNam\
espace(pkg) library(pkg, character.only = TRUE) } }) } future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { \
lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] ...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { { \
{ NULL future::plan(list(function (expr, envir = parent.frame(), substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), worke\
rs = Inf, ...) { if (substitute) expr <- substitute(expr) batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, temp\
late = template, type = "slurm", resources = resources, workers = workers, ...) }), .cleanup = FALSE, .init = FALSE) } options(...future.oldOptions) } })})
32: eval(expr, envir = envir)
33: eval(expr, envir = envir)
34: (function (expr, substitute = FALSE, envir = .GlobalEnv, ...) { if (substitute) expr <- substitute(expr) eval(expr, envir = envir)})(local({ { ...future.oldOptions <- options(future.startup.loadScript = FALSE, future.globals.onMis\
sing = "error") { { NULL local({ for (pkg in "drake") { loadNamespace(pkg) library(pkg, character.only = TRUE) } }) } \
future::plan("default", .cleanup = FALSE, .init = FALSE) } } tryCatch({ local({ do.call(function(...) { lapply(seq_along(...future.x_ii), FUN = function(jj) { ...future.x_jj <- ...future.x_ii[[jj]] \
...future.FUN(...future.x_jj, ...) }) }, args = future.call.arguments) }) }, finally = { { { NULL future::plan(list(function (expr, envir = parent.frame(), \
substitute = TRUE, globals = TRUE, label = NULL, template = "batchtools_slurm.tmpl", resources = list(), workers = Inf, ...) { if (substitute) expr <- substitute(expr) \
batchtools_by_template(expr, envir = envir, substitute = FALSE, globals = globals, label = label, template = template, type = "slurm", resources = resources, workers = workers, ...) }), .cle\
anup = FALSE, .init = FALSE) } options(...future.oldOptions) } })}), substitute = TRUE)
35: do.call(job$fun, job$pars, envir = .GlobalEnv)
36: with_preserve_seed({ set.seed(seed) code})
37: with_seed(job$seed, do.call(job$fun, job$pars, envir = .GlobalEnv))
38: execJob.Job(job)
39: execJob(job)
40: doTryCatch(return(expr), name, parentenv, handler)
41: tryCatchOne(expr, names, parentenv, handlers[[1L]])
42: tryCatchList(expr, classes, parentenv, handlers)
43: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") \
LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") \
if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { \
cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
44: try(execJob(job))
45: doJobCollection.JobCollection(obj, output = output)
46: doJobCollection.character("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
47: batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf0c91b37dd1deae3f4b129cf8189c303.rds")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/slurmd/job65947499/slurm_script: line 22: 25563 Illegal instruction (core dumped) Rscript -e 'batchtools::doJobCollection("/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.future/20171028_180231-5y4tkp/batchtools_1064049984/jobs/jobf\
0c91b37dd1deae3f4b129cf8189c303.rds")'
``` |
Hmm.... good to know, but over my head until I learn |
Other than account name and wall time, I have the same config as in your |
@kendonB would you check |
future_lapply()
-powered SLURM workflow
future_lapply()
-powered SLURM workflow
Reopening this issue with a different title. Right now, it's really about debugging a SLURM workflow. |
sessionInfos for the calling session and drake below. They appear to be the same. FWIW, I'm certain that the R environments on the build and compute nodes are identical and have access to the same files/packages when they're loaded. drake::session()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] prism_0.0.7 xtable_1.8-2 climateimpacts_0.1.0
[4] dtplyr_0.0.2 data.table_1.10.4-2 stringr_1.2.0
[7] plm_1.6-5 Formula_1.2-2 lfe_2.5-1998
[10] Matrix_1.2-10 feather_0.3.1 lubridate_1.6.0
[13] assertive_0.3-5 gistools_1.0 weatherdata_0.1.0
[16] raster_2.5-8 sp_1.2-5 bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2 dplyr_0.7.4
[22] purrr_0.2.4 readr_1.1.1 tidyr_0.7.2
[25] tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1
[28] drake_4.3.1.9000
loaded via a namespace (and not attached):
[1] minqa_1.2.4 assertive.base_0.0-7
[3] colorspace_1.3-2 rprojroot_1.2
[5] listenv_0.6.0 MatrixModels_0.4-1
[7] assertive.sets_0.0-3 xml2_1.1.1
[9] splines_3.4.0 assertive.data.uk_0.0-1
[11] codetools_0.2-15 R.methodsS3_1.7.1
[13] mnormt_1.5-5 knitr_1.17
[15] jsonlite_1.5 nloptr_1.0.4
[17] assertive.data.us_0.0-1 pbkrtest_0.4-7
[19] broom_0.4.2 R.oo_1.21.0
[21] compiler_3.4.0 httr_1.3.1
[23] backports_1.1.0 assertthat_0.2.0
[25] lazyeval_0.2.0 quantreg_5.33
[27] visNetwork_2.0.1 htmltools_0.3.6
[29] prettyunits_1.0.2 tools_3.4.0
[31] igraph_1.1.2 gtable_0.2.0
[33] glue_1.1.1 reshape2_1.4.2
[35] batchtools_0.9.6 rappdirs_0.3.1
[37] Rcpp_0.12.13 cellranger_1.1.0
[39] nlme_3.1-131 assertive.files_0.0-2
[41] assertive.datetimes_0.0-2 assertive.models_0.0-1
[43] lmtest_0.9-35 psych_1.7.5
[45] globals_0.10.3 lme4_1.1-13
[47] testthat_1.0.2 rvest_0.3.2
[49] eply_0.1.0 MASS_7.3-47
[51] zoo_1.8-0 scales_0.4.1
[53] hms_0.3 sandwich_2.4-0
[55] SparseM_1.77 assertive.matrices_0.0-1
[57] assertive.strings_0.0-3 geosphere_1.5-5
[59] bdsmatrix_1.3-2 stringi_1.1.5
[61] checkmate_1.8.5 storr_1.1.2
[63] rlang_0.1.2 pkgconfig_2.0.1
[65] evaluate_0.10.1 lattice_0.20-35
[67] assertive.data_0.0-1 bindr_0.1
[69] htmlwidgets_0.8 assertive.properties_0.0-4
[71] assertive.code_0.0-1 plyr_1.8.4
[73] magrittr_1.5 R6_2.2.2
[75] base64url_1.2 DBI_0.6-1
[77] mgcv_1.8-17 haven_1.1.0
[79] foreign_0.8-68 withr_2.0.0
[81] assertive.numbers_0.0-2 nnet_7.3-12
[83] car_2.1-4 modelr_0.1.0
[85] crayon_1.3.4 assertive.types_0.0-3
[87] progress_1.1.2 grid_3.4.0
[89] readxl_1.0.0 forcats_0.2.0
[91] digest_0.6.12 brew_1.0-6
[93] R.utils_2.5.0 munsell_0.4.3
[95] assertive.reflection_0.0-4
sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Matrix products: default
BLAS/LAPACK: /gpfs1m/apps/easybuild/RHEL6.3/sandybridge/software/imkl/2017.1.132-gimpi-2017a/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] prism_0.0.7 xtable_1.8-2 climateimpacts_0.1.0
[4] dtplyr_0.0.2 data.table_1.10.4-2 stringr_1.2.0
[7] plm_1.6-5 Formula_1.2-2 lfe_2.5-1998
[10] Matrix_1.2-10 feather_0.3.1 lubridate_1.6.0
[13] assertive_0.3-5 gistools_1.0 weatherdata_0.1.0
[16] raster_2.5-8 sp_1.2-5 bindrcpp_0.2
[19] future.batchtools_0.6.0 future_1.6.2 dplyr_0.7.4
[22] purrr_0.2.4 readr_1.1.1 tidyr_0.7.2
[25] tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1
[28] drake_4.3.1.9000
loaded via a namespace (and not attached):
[1] minqa_1.2.4 assertive.base_0.0-7
[3] colorspace_1.3-2 rprojroot_1.2
[5] listenv_0.6.0 MatrixModels_0.4-1
[7] assertive.sets_0.0-3 xml2_1.1.1
[9] splines_3.4.0 assertive.data.uk_0.0-1
[11] codetools_0.2-15 R.methodsS3_1.7.1
[13] mnormt_1.5-5 knitr_1.17
[15] jsonlite_1.5 nloptr_1.0.4
[17] assertive.data.us_0.0-1 pbkrtest_0.4-7
[19] broom_0.4.2 R.oo_1.21.0
[21] compiler_3.4.0 httr_1.3.1
[23] backports_1.1.0 assertthat_0.2.0
[25] lazyeval_0.2.0 quantreg_5.33
[27] visNetwork_2.0.1 htmltools_0.3.6
[29] prettyunits_1.0.2 tools_3.4.0
[31] igraph_1.1.2 gtable_0.2.0
[33] glue_1.1.1 reshape2_1.4.2
[35] batchtools_0.9.6 rappdirs_0.3.1
[37] Rcpp_0.12.13 cellranger_1.1.0
[39] nlme_3.1-131 assertive.files_0.0-2
[41] assertive.datetimes_0.0-2 assertive.models_0.0-1
[43] lmtest_0.9-35 psych_1.7.5
[45] globals_0.10.3 lme4_1.1-13
[47] testthat_1.0.2 rvest_0.3.2
[49] eply_0.1.0 MASS_7.3-47
[51] zoo_1.8-0 scales_0.4.1
[53] hms_0.3 sandwich_2.4-0
[55] SparseM_1.77 assertive.matrices_0.0-1
[57] assertive.strings_0.0-3 geosphere_1.5-5
[59] bdsmatrix_1.3-2 stringi_1.1.5
[61] checkmate_1.8.5 storr_1.1.2
[63] rlang_0.1.2 pkgconfig_2.0.1
[65] evaluate_0.10.1 lattice_0.20-35
[67] assertive.data_0.0-1 bindr_0.1
[69] htmlwidgets_0.8 assertive.properties_0.0-4
[71] assertive.code_0.0-1 plyr_1.8.4
[73] magrittr_1.5 R6_2.2.2
[75] base64url_1.2 DBI_0.6-1
[77] mgcv_1.8-17 haven_1.1.0
[79] foreign_0.8-68 withr_2.0.0
[81] assertive.numbers_0.0-2 nnet_7.3-12
[83] car_2.1-4 modelr_0.1.0
[85] crayon_1.3.4 assertive.types_0.0-3
[87] progress_1.1.2 grid_3.4.0
[89] readxl_1.0.0 forcats_0.2.0
[91] digest_0.6.12 brew_1.0-6
[93] R.utils_2.5.0 munsell_0.4.3
[95] assertive.reflection_0.0-4
|
Yup, the It looks like the root problem is a failed attempt to load I can think of a couple things to try, but they probably won't be sufficient.
|
By the way, are you loading |
Fantastic that you got SLURM to work! To start, the above file path is just the location I was running the drake example, so that's exactly where it should be looking for stuff. Unfortunately, I still see the same error. This was using your new
|
How well do you know your way around batchtools itself? I am a complete novice, so I will need to do more digging before I can have future suggestions. That and we should ask @mllg and @HenrikBengtsson for help. What are your versions of |
Wait... I see from your session info that your versions agree with mine. |
Batchtools novice here, unfortunately. At some point, I'll try some minimal examples from that package. |
Hmm... come to think of it, library(future.batchtools)
plan(batchtools_slurm(template = "batchtools.slurm.tmpl")) # future::plan(), not drake::plan()
future_lapply(1:2, cat) Sorry about the awkward back-and-forth again. I guess the trick for me is getting a SLURM installation just buggy enough to fail at the right times. |
Yep same error - will report in future.batchtools |
Good, now we know it's not actually |
As per futureverse/future.batchtools#11, I solved it with another configuration flag that was missing ( |
Best news I have heard all week! (Not saying much for a Monday, but you get the idea.) Totally worth the time. I will close this issue, but I have a couple more questions.
|
The solutionFor completeness: from @kendonB via futureverse/future.batchtools#11:
|
@kendonB in case it makes you feel any better, I was planning to go to the trouble to install SLURM anyway so I could test the minimal example and fix some of the issues my colleagues from grad school were having. Speaking of whom: @jarad, @emittman, and @nachalca, |
(@nachalca, have you had a chance to check out the hpc resources at your new job?) |
|
@wlandau-lilly, I find when running my project using future_lapply, after the slurm jobs complete, the host R processes memory usage blows up (slowly) in htop. Even if this is isn't real memory usage, it's still problematic as I'm running the host process on the shared build node. Is this the behavior you expect? Is the host process bringing back all the data to the host before writing to disk? |
Hmm... I thought I had avoided that problem. I even prune the environment to make sure unnecessary targets are removed from memory at each parallelizable stage. Do you have the same memory issues if you call |
If you think #117 might work for you, you might compare the host memory usage there. |
Just how slowly does host memory blow up? What is the progression? |
In short, |
If this is happening "slowly" and "after the slurm jobs complete" it may suggests that it occurs in the step where the values from all the jobs are gathered and brought back to the master R process by |
I think I know what the problem is: build_distributed() returns the whole configuration list. I was unwisely using this to keep track of which targets were attempted. I will fix this today. |
Re: Do you have the same memory issues if you call make() on an up-to-date project?: I will try to remember to check this once the project is up to date. Re: Just how slowly does host memory blow up? What is the progression? I first looked at around the time the last job finished which was around 20 minutes after the first job finished and it was using 10GB in htop. Next, I watched for about another 20 minutes and it grew to an ultimate ~20GB. Re: you've mention "large number of jobs" elsewhere - what is "large" in your examples? In this particular example I was building 200 targets which are each 163MB. Thankfully the implied total is certainly higher than 20GB. Re: 1e5daed fixes the memory issues, pending confirmation from @kendonB. I will try this today. |
I can see that you've been working on integrating the future package which is an exciting development.
I have a project with a stage that requires a lot of memory per CPU and another stage that requires a lot less. Ideally, I'd be able to get drake to schedule a bunch of slurm jobs for the first stage with a lot of memory per CPU, have drake/future wait for it to finish, then schedule a bunch more slurm jobs with the lower memory per CPU.
Slurm can also program dependencies natively which would be nice to have automated through drake.
Is this already possible?
I should also note that my HPC has a limit of 1000 array jobs and I would expect other science organizations to have similar limits. Breaking up the call into a separate sbatch/srun call per target would work I think.
The text was updated successfully, but these errors were encountered: