Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

future_lapply workers don't start working right away #449

Closed
kendonB opened this issue Jul 3, 2018 · 28 comments
Closed

future_lapply workers don't start working right away #449

kendonB opened this issue Jul 3, 2018 · 28 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Jul 3, 2018

I am running make using future_lapply parallelism on a SLURM cluster via future.batchtools with jobs = 100.

When examining the batchtools logs after 10 minutes, I still don't see the target ... message indicating that the worker has started working.

When running the make with 2 targets and 2 jobs I see the target ... message a few seconds after they started.

Can you give me some ideas for how to troubleshoot this?

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

Just tried limiting the number of possible targets to 100 and it started going after about 3 minutes. I'm guessing it's the loop in #435 which I believe scales with the number of targets being built?

@wlandau
Copy link
Member

wlandau commented Jul 3, 2018

It could be #435, but I think it is more likely related to futureverse/future#232. It is difficult to capture stdout and stderr over a network, especially for persistent future-based workers. I have thought about locally logging target ... messages from the master process instead, but the timing with actual builds would be even worse. Most of the relevant code is in mclapply.R, mc_utils.R, and future_lapply.R. I will reopen this issue if you have specific solutions to try.

@wlandau wlandau closed this as completed Jul 3, 2018
@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

My understanding is that the stdout messages are being captured by batchtools and written to log files. In December, when I was running another big project I saw the "target" messages in the batchtools logs very quickly after the worker appeared so I doubt that it has anything to do with capturing stdout.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

This one didn't seem to get going for the full 5 hours it was going overnight. Just 105/9851 targets got built and each one should take about 4 minutes of CPU time for the building stage. That's just 420 / 30000 minutes of CPU time going on building targets. :/

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

When I run fl_worker while debugging in the master session, I see that it gets stuck in a while loop here:

while (nrow(msg <- ready_queue$list(1)) < 1){

As in, I see:

Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1
Browse[5]> n
debug: gc()
Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1
Browse[5]> n
debug: gc()
Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1

@wlandau, what behaviour do you expect here and what exactly is it waiting for?

@wlandau
Copy link
Member

wlandau commented Jul 3, 2018

My understanding is that the stdout messages are being captured by batchtools and written to log files.

Back in December, future_lapply parallelism used transient staged workers (one worker per target). Now, future_lapply workers stay running. I guess it depends on how often batchtools writes the log files. If batchtools writes the logs only intermittently before the end of the worker, then this behavior is expected. If the logs are written as frequently as possible over the lifetime of the worker, my explanation may not be accurate.

When I run fl_worker while debugging in the master session, I see that it gets stuck in a while loop

mc_worker() (and by extension fl_worker()) waits to receive targets from the master. Communication is managed with the txtq package. While txtq is safe for network file systems, it does use the file system for interprocess communication, so I do expect minor delays. It should not wait in the loop indefinitely. When you were debugging interactively, did you submit an asynchronous master process?

drake/R/future_lapply.R

Lines 11 to 15 in 9abd398

tmp <- system2(
rscript,
shQuote(c("-e", paste0("drake::fl_master('", path, "')"))),
wait = FALSE
)

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

I did submit the asynchronous master process and I can see it there in htop. It is consuming CPU resources. I will try debugging that one in a second interactive R session

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

It may be related to the fl_master zombie processes floating around on the master node...:

...64/R/bin/exec/R --slave --no-restore -e drake::fl_master('/gpfs1m/projects/landcare00063/projects_ac/propertyrights/.drake')
...64/R/bin/exec/R --slave --no-restore -e drake::fl_master('/gpfs1m/projects/landcare00063/projects_ac/propertyrights/.drake')

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

@wlandau From what I can tell, this code is waiting for all the workers to say "I'm ready!" by their ready paths existing? Does this mean that the master process will wait in that while loop until all the workers are active? If something goes wrong with one worker, then none of them start going?

mc_ensure_workers <- function(config){
  paths <- vapply(
    X = config$cache$list(namespace = "mc_ready_db"),
    FUN = function(worker){
      config$cache$get(key = worker, namespace = "mc_ready_db")
    },
    FUN.VALUE = character(1)
  )
  while (!all(file.exists(paths))){
    Sys.sleep(mc_wait) # nocov
  }
}

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

Also, if a worker is in the SLURM queue it wouldn't show as "ready"?

@kendonB
Copy link
Contributor Author

kendonB commented Jul 3, 2018

Since all 10000 targets in the first stage have a common dependency, it's also possible that this is because all 100 workers are calling drake_meta at the same time and trying to hash the common dependency. However, I'm still seeing this behavior when using skip_imports = TRUE and trigger = "always".

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

@wlandau From what I can tell, this code is waiting for all the workers to say "I'm ready!" by their ready paths existing? Does this mean that the master process will wait in that while loop until all the workers are active? If something goes wrong with one worker, then none of them start going?

Yes, I thought this policy would maximize safety and ensure decent load balancing (see also #453 (comment)).

Also, if a worker is in the SLURM queue it wouldn't show as "ready"?

Correct. The worker needs to actually start and send a "ready" message to the master with a txtq message.

Since all 10000 targets in the first stage have a common dependency, it's also possible that this is because all 100 workers are calling drake_meta at the same time and trying to hash the common dependency.

Could be, but I think it is unlikely with persistent future_lapply workers.

For what it's worth, now that #452 is merged, make(parallelism = "clustermq_staged") can spin up workers a whole lot faster. Caching happens on the master process, but you can speed it up by adding more import-level jobs. I will have documentation in the manual soon, but here is a taste for now.

library(drake)
drake_hpc_template_file("slurm_clustermq.tmpl") # modify manually
# See https://github.com/mschubert/clustermq/issues/88
options(clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl")
load_mtcars_example()
make(my_plan, parallelism = "clustermq_staged", jobs = 4, verbose = 4)

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

@wlandau OK with verbose = 4 it's easier to see what's going on. There are lots of skips getting printed to the logs but they're just getting processed really really slowly.

The workers are spending ages getting through the targets that are up to date. By comparing log lengths, it looks like a worker is taking about 2-3 seconds to process an up-to-date target using future_lapply with SLURM. This is about 30 times what it seems to take when using outdated.

Does the clustermq_staged method move that target-level overhead up the chain?

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

I'm going to try investigating further by running the last worker in an interactive session

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

@wlandau Does the clustermq_staged parallelism send data back to the master for caching continuously? Or all at the end?

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

At every stage.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

For some feedback, I tested making 600 ~5 min targets using 200 jobs and they made (including caching) in about 20 minutes. Pretty good!

However, I also tried making 1000 ~5 min targets using 500 jobs and it failed to complete in an hour. I can't tell if they got through the first 500 because they failed before getting a chance to send content back for caching.

I then tried 3000 targets using 200 jobs, left it for about 20 mins, and I then saw the Running 3,000 calculations message but no jobs in the slurm queue.

Finally, I'm trying the 3000 again and I still see the SLURM jobs 10 minutes in.

Doesn't seem like clustermq is quite robust enough to handle large HPC jobs yet. Maybe allowing for caching by the workers could improve things.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

Edit: the jobs disappeared between 10 and 20 minutes when trying to build 3000 targets using 200 jobs.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

Did at least some of the targets complete? You probably did not hit your wall time limit, but I have to ask anyway. Although the targets are processed in stages, they all share the same common pool of persistent workers (ref: mschubert/clustermq#86 (comment)).

@kendonB
Copy link
Contributor Author

kendonB commented Jul 4, 2018

None of the targets got cached, but I couldn't tell if any targets were completed before they were sent back to the master for caching. And correct, I wasn't near my wall time limit.

@wlandau
Copy link
Member

wlandau commented Jul 4, 2018

Would you be willing to try those same experiments on clustermq by itself? It's super user friendly.

@mschubert
Copy link

@kendonB Could it be that you run out of memory? Please try clustermq with log_worker=TRUE to see the cause of disappearing jobs.

That should not happen, but I found that the guard using ulimit is not reliable on all systems. If memory is the cause of jobs crashing, please file a bug.

@kendonB
Copy link
Contributor Author

kendonB commented Jul 6, 2018

Doesn't look like memory. Looks like the system administration isn't into it:

[349] "slurmstepd: error: *** JOB 67654100 ON compute-d1-068 CANCELLED AT 2018-07-06T19:42:20 ***"
[350] "slurmstepd: error: *** JOB 67654030 ON compute-d1-068 CANCELLED AT 2018-07-06T19:42:19 ***"
[351] "slurmstepd: error: *** JOB 67654043 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[352] "slurmstepd: error: *** JOB 67654052 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[353] "slurmstepd: error: *** JOB 67654045 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[354] "slurmstepd: error: *** JOB 67654051 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[355] "slurmstepd: error: *** JOB 67654054 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[356] "slurmstepd: error: *** JOB 67654040 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"
[357] "slurmstepd: error: *** JOB 67654042 ON compute-d1-005 CANCELLED AT 2018-07-06T19:42:19 ***"

@wlandau
Copy link
Member

wlandau commented Jul 6, 2018

I wonder what the objection could be. Please let me know if I should modify drake's example SLURM/clustermq template file.

@wlandau
Copy link
Member

wlandau commented Oct 10, 2018

@kendonB have you been able to find out why these clustermq jobs were cancelled? I still find it strange that this happened. clustermq uses job arrays, which I thought sys admins tended to prefer.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 10, 2018

Never got to the bottom of it. The HPC has now undergone a revamp so it might work better now. They did say that cluster jobs shouldn't usually use the network so it might be network constraints. Of course, it saves on disk usage, which seems to bite far more often than network usage. Have to go through the hassle of installing zeromq again so I'll try this out again some time down the line.

@wlandau
Copy link
Member

wlandau commented Oct 12, 2018

Yeah, clustermq/ZeroMQ do rely on the network pretty heavily. But for your workflows, you would want to use make(caching = "worker") anyway, so the actual targets will not travel over ZeroMQ sockets in your case. Anyway, please keep me posted.

@wlandau
Copy link
Member

wlandau commented Nov 3, 2019

Dynamic branching reduces the number of static targets, which helps everything get started faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants