Hanging within dynamic targets #1150

jennysjaarda · 2020-01-27T16:01:07Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
Advanced users: verify that the bug still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.

Description

I have a drake plan with dynamic targets:

my_plan <- drake_plan(

models_to_run = {
    summ_stats_out
    define_models(!!traits)},

  gwas = target(
    {
      household_GWAS(models_to_run$i,models_to_run$trait_ID, models_to_run$exposure_sex,
        (models_to_run$phenotype_file),models_to_run$phenotype_col,models_to_run$phenotype_description,
        (models_to_run$IV_file),sample_file,pheno_cov=!!joint_model_adjustments,
        !!household_intervals,!!household_time_munge,(models_to_run$gwas_outcome_file))

    }, dynamic = map(models_to_run)
  ),
)

make(my_plan,parallelism = "clustermq",console_log_file = "proxymr.log",  cache_log_file = "cache_log.csv",
  memory_strategy = "lookahead", garbage_collection = TRUE, 
  jobs = 100, template = list(cpus = 1, partition = "sgg",
  log_file="/data/sgg2/jenny/projects/proxyMR/proxymr_%a_clustermq.out"))

household_GWAS is building data frames in a list:

household_GWAS <- function(...){
...
out <- list(list(outcome_gwas_out = outcome_gwas_out))
return(out)

}

All dynamic targets (there are 258) get sent to workers but only a portion are being built:

 grep "2020-01-27" proxymr.log | grep "gwas_.* | time" | wc -l 
#186

grep "2020-01-27" proxymr.log | grep "gwas_.* | store" | wc -l
#186

grep "2020-01-27" proxymr.log | grep "gwas_.* | build" | wc -l
#185 

grep "2020-01-27" proxymr.log | grep "gwas_.* | subtarget" | wc -l  
#258

When I look at each of the slurm log files log_file="/data/sgg2/jenny/projects/proxyMR/proxymr_%a_clustermq.out", they are all either waiting or killed because of timeout:

2020-01-27 16:55:48.550356 | > WORKER_WAIT (0.000s wait)
2020-01-27 16:55:48.551064 | waiting 5.00s

OR

Error in clustermq:::worker("tcp://node04:7804") : 
  Timeout reached, terminating
Execution halted

Any idea why these targets wouldn't be building or stored?

The text was updated successfully, but these errors were encountered:

wlandau · 2020-01-27T17:38:22Z

As with #349, #1146, #1148, and mschubert/clustermq#101, this looks to be due to the clustermq worker timeouts we are familiar with. cc @mschubert.

Can you reproduce it with just clustermq without drake? Maybe something like clustermq::Q_rows(fun = household_GWAS, df = models_to_run, ...).

wlandau · 2020-01-27T17:43:33Z

Closing because I strongly suspect the solution is going to need to come from clustermq. In fact, @mschubert already increased the default worker timeout in https://github.com/mschubert/clustermq/tree/timeout.

mschubert · 2020-01-27T20:25:47Z

When I look at each of the slurm log files, they are all either waiting or killed because of timeout

I don't see how this could happen. With 100 workers and 258 targets, no worker should be sent a wait signal.

Can you provide a minimal example that reproduces this behavior using clustermq without drake?

jennysjaarda added the type: bug label Jan 27, 2020

jennysjaarda assigned wlandau Jan 27, 2020

wlandau added type: trouble and removed type: bug labels Jan 27, 2020

wlandau mentioned this issue Jan 27, 2020

Parallel make randomly hangs #1148

Closed

3 tasks

wlandau closed this as completed Jan 27, 2020

wlandau mentioned this issue Jan 27, 2020

User-specified worker timeouts mschubert/clustermq#188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging within dynamic targets #1150

Hanging within dynamic targets #1150

jennysjaarda commented Jan 27, 2020

wlandau commented Jan 27, 2020

wlandau commented Jan 27, 2020

mschubert commented Jan 27, 2020

Hanging within dynamic targets #1150

Hanging within dynamic targets #1150

Comments

jennysjaarda commented Jan 27, 2020

Prework

Description

wlandau commented Jan 27, 2020

wlandau commented Jan 27, 2020

mschubert commented Jan 27, 2020