Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging within dynamic targets #1150

Closed
3 tasks done
jennysjaarda opened this issue Jan 27, 2020 · 3 comments
Closed
3 tasks done

Hanging within dynamic targets #1150

jennysjaarda opened this issue Jan 27, 2020 · 3 comments
Assignees

Comments

@jennysjaarda
Copy link

Prework

Description

I have a drake plan with dynamic targets:

my_plan <- drake_plan(

models_to_run = {
    summ_stats_out
    define_models(!!traits)},

  gwas = target(
    {
      household_GWAS(models_to_run$i,models_to_run$trait_ID, models_to_run$exposure_sex,
        (models_to_run$phenotype_file),models_to_run$phenotype_col,models_to_run$phenotype_description,
        (models_to_run$IV_file),sample_file,pheno_cov=!!joint_model_adjustments,
        !!household_intervals,!!household_time_munge,(models_to_run$gwas_outcome_file))

    }, dynamic = map(models_to_run)
  ),
)

make(my_plan,parallelism = "clustermq",console_log_file = "proxymr.log",  cache_log_file = "cache_log.csv",
  memory_strategy = "lookahead", garbage_collection = TRUE, 
  jobs = 100, template = list(cpus = 1, partition = "sgg",
  log_file="/data/sgg2/jenny/projects/proxyMR/proxymr_%a_clustermq.out"))

household_GWAS is building data frames in a list:

household_GWAS <- function(...){
...
out <- list(list(outcome_gwas_out = outcome_gwas_out))
return(out)

}

All dynamic targets (there are 258) get sent to workers but only a portion are being built:

 grep "2020-01-27" proxymr.log | grep "gwas_.* | time" | wc -l 
#186

grep "2020-01-27" proxymr.log | grep "gwas_.* | store" | wc -l
#186

grep "2020-01-27" proxymr.log | grep "gwas_.* | build" | wc -l
#185 

grep "2020-01-27" proxymr.log | grep "gwas_.* | subtarget" | wc -l  
#258 

When I look at each of the slurm log files log_file="/data/sgg2/jenny/projects/proxyMR/proxymr_%a_clustermq.out", they are all either waiting or killed because of timeout:

2020-01-27 16:55:48.550356 | > WORKER_WAIT (0.000s wait)
2020-01-27 16:55:48.551064 | waiting 5.00s

OR

Error in clustermq:::worker("tcp://node04:7804") : 
  Timeout reached, terminating
Execution halted

Any idea why these targets wouldn't be building or stored?

@wlandau
Copy link
Member

wlandau commented Jan 27, 2020

As with #349, #1146, #1148, and mschubert/clustermq#101, this looks to be due to the clustermq worker timeouts we are familiar with. cc @mschubert.

Can you reproduce it with just clustermq without drake? Maybe something like clustermq::Q_rows(fun = household_GWAS, df = models_to_run, ...).

@wlandau
Copy link
Member

wlandau commented Jan 27, 2020

Closing because I strongly suspect the solution is going to need to come from clustermq. In fact, @mschubert already increased the default worker timeout in https://github.com/mschubert/clustermq/tree/timeout.

@mschubert
Copy link

When I look at each of the slurm log files, they are all either waiting or killed because of timeout

I don't see how this could happen. With 100 workers and 258 targets, no worker should be sent a wait signal.

Can you provide a minimal example that reproduces this behavior using clustermq without drake?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants