future_lapply workers don't start working right away #449

kendonB · 2018-07-03T11:42:11Z

I am running make using future_lapply parallelism on a SLURM cluster via future.batchtools with jobs = 100.

When examining the batchtools logs after 10 minutes, I still don't see the target ... message indicating that the worker has started working.

When running the make with 2 targets and 2 jobs I see the target ... message a few seconds after they started.

Can you give me some ideas for how to troubleshoot this?

The text was updated successfully, but these errors were encountered:

kendonB · 2018-07-03T11:56:25Z

Just tried limiting the number of possible targets to 100 and it started going after about 3 minutes. I'm guessing it's the loop in #435 which I believe scales with the number of targets being built?

wlandau · 2018-07-03T19:18:56Z

It could be #435, but I think it is more likely related to futureverse/future#232. It is difficult to capture stdout and stderr over a network, especially for persistent future-based workers. I have thought about locally logging target ... messages from the master process instead, but the timing with actual builds would be even worse. Most of the relevant code is in mclapply.R, mc_utils.R, and future_lapply.R. I will reopen this issue if you have specific solutions to try.

kendonB · 2018-07-03T19:38:58Z

My understanding is that the stdout messages are being captured by batchtools and written to log files. In December, when I was running another big project I saw the "target" messages in the batchtools logs very quickly after the worker appeared so I doubt that it has anything to do with capturing stdout.

kendonB · 2018-07-03T20:20:20Z

This one didn't seem to get going for the full 5 hours it was going overnight. Just 105/9851 targets got built and each one should take about 4 minutes of CPU time for the building stage. That's just 420 / 30000 minutes of CPU time going on building targets. :/

kendonB · 2018-07-03T20:34:29Z

When I run fl_worker while debugging in the master session, I see that it gets stuck in a while loop here:

drake/R/mclapply.R

Line 75 in e4df333

while (nrow(msg <- ready_queue$list(1)) < 1){

As in, I see:

Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1
Browse[5]> n
debug: gc()
Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1
Browse[5]> n
debug: gc()
Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1

@wlandau, what behaviour do you expect here and what exactly is it waiting for?

wlandau · 2018-07-03T20:41:35Z

My understanding is that the stdout messages are being captured by batchtools and written to log files.

Back in December, future_lapply parallelism used transient staged workers (one worker per target). Now, future_lapply workers stay running. I guess it depends on how often batchtools writes the log files. If batchtools writes the logs only intermittently before the end of the worker, then this behavior is expected. If the logs are written as frequently as possible over the lifetime of the worker, my explanation may not be accurate.

When I run fl_worker while debugging in the master session, I see that it gets stuck in a while loop

mc_worker() (and by extension fl_worker()) waits to receive targets from the master. Communication is managed with the txtq package. While txtq is safe for network file systems, it does use the file system for interprocess communication, so I do expect minor delays. It should not wait in the loop indefinitely. When you were debugging interactively, did you submit an asynchronous master process?

drake/R/future_lapply.R

Lines 11 to 15 in 9abd398

    
           tmp <- system2( 
        
             rscript, 
        
             shQuote(c("-e", paste0("drake::fl_master('", path, "')"))), 
        
             wait = FALSE 
        
           )

kendonB · 2018-07-03T20:50:19Z

I did submit the asynchronous master process and I can see it there in htop. It is consuming CPU resources. I will try debugging that one in a second interactive R session

kendonB · 2018-07-03T20:53:35Z

It may be related to the fl_master zombie processes floating around on the master node...:

...64/R/bin/exec/R --slave --no-restore -e drake::fl_master('/gpfs1m/projects/landcare00063/projects_ac/propertyrights/.drake')
...64/R/bin/exec/R --slave --no-restore -e drake::fl_master('/gpfs1m/projects/landcare00063/projects_ac/propertyrights/.drake')

kendonB · 2018-07-03T21:11:15Z

@wlandau From what I can tell, this code is waiting for all the workers to say "I'm ready!" by their ready paths existing? Does this mean that the master process will wait in that while loop until all the workers are active? If something goes wrong with one worker, then none of them start going?

mc_ensure_workers <- function(config){
  paths <- vapply(
    X = config$cache$list(namespace = "mc_ready_db"),
    FUN = function(worker){
      config$cache$get(key = worker, namespace = "mc_ready_db")
    },
    FUN.VALUE = character(1)
  )
  while (!all(file.exists(paths))){
    Sys.sleep(mc_wait) # nocov
  }
}

kendonB · 2018-07-03T21:16:19Z

Also, if a worker is in the SLURM queue it wouldn't show as "ready"?

kendonB · 2018-07-03T21:57:30Z

Since all 10000 targets in the first stage have a common dependency, it's also possible that this is because all 100 workers are calling drake_meta at the same time and trying to hash the common dependency. However, I'm still seeing this behavior when using skip_imports = TRUE and trigger = "always".

wlandau · 2018-07-04T01:04:09Z

@wlandau From what I can tell, this code is waiting for all the workers to say "I'm ready!" by their ready paths existing? Does this mean that the master process will wait in that while loop until all the workers are active? If something goes wrong with one worker, then none of them start going?

Yes, I thought this policy would maximize safety and ensure decent load balancing (see also #453 (comment)).

Also, if a worker is in the SLURM queue it wouldn't show as "ready"?

Correct. The worker needs to actually start and send a "ready" message to the master with a txtq message.

Since all 10000 targets in the first stage have a common dependency, it's also possible that this is because all 100 workers are calling drake_meta at the same time and trying to hash the common dependency.

Could be, but I think it is unlikely with persistent future_lapply workers.

For what it's worth, now that #452 is merged, make(parallelism = "clustermq_staged") can spin up workers a whole lot faster. Caching happens on the master process, but you can speed it up by adding more import-level jobs. I will have documentation in the manual soon, but here is a taste for now.

library(drake)
drake_hpc_template_file("slurm_clustermq.tmpl") # modify manually
# See https://github.com/mschubert/clustermq/issues/88
options(clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl")
load_mtcars_example()
make(my_plan, parallelism = "clustermq_staged", jobs = 4, verbose = 4)

kendonB · 2018-07-04T01:48:21Z

@wlandau OK with verbose = 4 it's easier to see what's going on. There are lots of skips getting printed to the logs but they're just getting processed really really slowly.

The workers are spending ages getting through the targets that are up to date. By comparing log lengths, it looks like a worker is taking about 2-3 seconds to process an up-to-date target using future_lapply with SLURM. This is about 30 times what it seems to take when using outdated.

Does the clustermq_staged method move that target-level overhead up the chain?

kendonB · 2018-07-04T01:54:58Z

I'm going to try investigating further by running the last worker in an interactive session

kendonB · 2018-07-04T09:41:46Z

@wlandau Does the clustermq_staged parallelism send data back to the master for caching continuously? Or all at the end?

wlandau · 2018-07-04T10:43:01Z

At every stage.

kendonB · 2018-07-04T11:17:20Z

For some feedback, I tested making 600 ~5 min targets using 200 jobs and they made (including caching) in about 20 minutes. Pretty good!

However, I also tried making 1000 ~5 min targets using 500 jobs and it failed to complete in an hour. I can't tell if they got through the first 500 because they failed before getting a chance to send content back for caching.

I then tried 3000 targets using 200 jobs, left it for about 20 mins, and I then saw the Running 3,000 calculations message but no jobs in the slurm queue.

Finally, I'm trying the 3000 again and I still see the SLURM jobs 10 minutes in.

Doesn't seem like clustermq is quite robust enough to handle large HPC jobs yet. Maybe allowing for caching by the workers could improve things.

kendonB · 2018-07-04T11:29:24Z

Edit: the jobs disappeared between 10 and 20 minutes when trying to build 3000 targets using 200 jobs.

wlandau · 2018-07-04T22:18:21Z

Did at least some of the targets complete? You probably did not hit your wall time limit, but I have to ask anyway. Although the targets are processed in stages, they all share the same common pool of persistent workers (ref: mschubert/clustermq#86 (comment)).

kendonB · 2018-07-04T22:29:23Z

None of the targets got cached, but I couldn't tell if any targets were completed before they were sent back to the master for caching. And correct, I wasn't near my wall time limit.

wlandau · 2018-07-04T22:30:54Z

Would you be willing to try those same experiments on clustermq by itself? It's super user friendly.

mschubert · 2018-07-05T20:02:13Z

@kendonB Could it be that you run out of memory? Please try clustermq with log_worker=TRUE to see the cause of disappearing jobs.

That should not happen, but I found that the guard using ulimit is not reliable on all systems. If memory is the cause of jobs crashing, please file a bug.

kendonB · 2018-07-06T07:46:00Z

Doesn't look like memory. Looks like the system administration isn't into it:

[349] "slurmstepd: error: *** JOB 67654100 ON compute-d1-068 CANCELLED AT 2018-07-06T19:42:20 ***"
[350] "slurmstepd: error: *** JOB 67654030 ON compute-d1-068 CANCELLED AT 2018-07-06T19:42:19 ***"
[351] "slurmstepd: error: *** JOB 67654043 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[352] "slurmstepd: error: *** JOB 67654052 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[353] "slurmstepd: error: *** JOB 67654045 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[354] "slurmstepd: error: *** JOB 67654051 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[355] "slurmstepd: error: *** JOB 67654054 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[356] "slurmstepd: error: *** JOB 67654040 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[357] "slurmstepd: error: *** JOB 67654042 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"

wlandau · 2018-07-06T09:08:05Z

I wonder what the objection could be. Please let me know if I should modify drake's example SLURM/clustermq template file.

wlandau · 2018-10-10T13:09:36Z

@kendonB have you been able to find out why these clustermq jobs were cancelled? I still find it strange that this happened. clustermq uses job arrays, which I thought sys admins tended to prefer.

kendonB · 2018-10-10T18:34:10Z

Never got to the bottom of it. The HPC has now undergone a revamp so it might work better now. They did say that cluster jobs shouldn't usually use the network so it might be network constraints. Of course, it saves on disk usage, which seems to bite far more often than network usage. Have to go through the hassle of installing zeromq again so I'll try this out again some time down the line.

wlandau · 2018-10-12T15:56:18Z

Yeah, clustermq/ZeroMQ do rely on the network pretty heavily. But for your workflows, you would want to use make(caching = "worker") anyway, so the actual targets will not travel over ZeroMQ sockets in your case. Anyway, please keep me posted.

wlandau · 2019-11-03T14:09:24Z

Dynamic branching reduces the number of static targets, which helps everything get started faster.

wlandau added status: may revisit type: use case labels Jul 3, 2018

wlandau closed this as completed Jul 3, 2018

kendonB mentioned this issue Jul 3, 2018

mc_ensure_workers seems to ensure _all_ workers are ready - shouldn't it ensure if _any_ are ready? #453

Closed

This was referenced Jul 4, 2018

Consider removing or optionalizing the gc() in this while loop #454

Closed

Preprocess up-to-date targets in mc_master #455

Closed

wlandau mentioned this issue Oct 10, 2018

Get the worker to do the loading in future parallelism? #539

Closed

wlandau mentioned this issue Oct 16, 2018

Uninformative error message when clustermq workers run out of memory #549

Closed

wlandau mentioned this issue Oct 27, 2018

Remove all non-clustermq parallel backends? #561

Closed

wlandau removed the status: may revisit label Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

future_lapply workers don't start working right away #449

future_lapply workers don't start working right away #449

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

wlandau commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

wlandau commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018 •

edited

Loading

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018 •

edited

Loading

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

kendonB commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

mschubert commented Jul 5, 2018

kendonB commented Jul 6, 2018

wlandau commented Jul 6, 2018

wlandau commented Oct 10, 2018

kendonB commented Oct 10, 2018

wlandau commented Oct 12, 2018 •

edited

Loading

wlandau commented Nov 3, 2019

future_lapply workers don't start working right away #449

future_lapply workers don't start working right away #449

Comments

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

wlandau commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

wlandau commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018 • edited Loading

kendonB commented Jul 3, 2018

kendonB commented Jul 3, 2018 • edited Loading

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

kendonB commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

kendonB commented Jul 4, 2018

wlandau commented Jul 4, 2018

mschubert commented Jul 5, 2018

kendonB commented Jul 6, 2018

wlandau commented Jul 6, 2018

wlandau commented Oct 10, 2018

kendonB commented Oct 10, 2018

wlandau commented Oct 12, 2018 • edited Loading

wlandau commented Nov 3, 2019

kendonB commented Jul 3, 2018 •

edited

Loading

kendonB commented Jul 3, 2018 •

edited

Loading

wlandau commented Oct 12, 2018 •

edited

Loading