-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
future_lapply workers don't start working right away #449
Comments
Just tried limiting the number of possible targets to 100 and it started going after about 3 minutes. I'm guessing it's the loop in #435 which I believe scales with the number of targets being built? |
It could be #435, but I think it is more likely related to futureverse/future#232. It is difficult to capture stdout and stderr over a network, especially for persistent |
My understanding is that the stdout messages are being captured by batchtools and written to log files. In December, when I was running another big project I saw the "target" messages in the batchtools logs very quickly after the worker appeared so I doubt that it has anything to do with capturing stdout. |
This one didn't seem to get going for the full 5 hours it was going overnight. Just 105/9851 targets got built and each one should take about 4 minutes of CPU time for the building stage. That's just 420 / 30000 minutes of CPU time going on building targets. :/ |
When I run Line 75 in e4df333
As in, I see: Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1
Browse[5]> n
debug: gc()
Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1
Browse[5]> n
debug: gc()
Browse[5]> n
ndebug: Sys.sleep(mc_wait)
Browse[5]> n
debug: (while) nrow(msg <- ready_queue$list(1)) < 1 @wlandau, what behaviour do you expect here and what exactly is it waiting for? |
Back in December,
Lines 11 to 15 in 9abd398
|
I did submit the asynchronous master process and I can see it there in |
It may be related to the fl_master zombie processes floating around on the master node...:
|
@wlandau From what I can tell, this code is waiting for all the workers to say "I'm ready!" by their ready paths existing? Does this mean that the master process will wait in that while loop until all the workers are active? If something goes wrong with one worker, then none of them start going?
|
Also, if a worker is in the SLURM queue it wouldn't show as "ready"? |
Since all 10000 targets in the first stage have a common dependency, it's also possible that this is because all 100 workers are calling |
Yes, I thought this policy would maximize safety and ensure decent load balancing (see also #453 (comment)).
Correct. The worker needs to actually start and send a "ready" message to the master with a
Could be, but I think it is unlikely with persistent For what it's worth, now that #452 is merged, library(drake)
drake_hpc_template_file("slurm_clustermq.tmpl") # modify manually
# See https://github.com/mschubert/clustermq/issues/88
options(clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl")
load_mtcars_example()
make(my_plan, parallelism = "clustermq_staged", jobs = 4, verbose = 4) |
@wlandau OK with The workers are spending ages getting through the targets that are up to date. By comparing log lengths, it looks like a worker is taking about 2-3 seconds to process an up-to-date target using future_lapply with SLURM. This is about 30 times what it seems to take when using Does the |
I'm going to try investigating further by running the last worker in an interactive session |
@wlandau Does the clustermq_staged parallelism send data back to the master for caching continuously? Or all at the end? |
At every stage. |
For some feedback, I tested making 600 ~5 min targets using 200 jobs and they made (including caching) in about 20 minutes. Pretty good! However, I also tried making 1000 ~5 min targets using 500 jobs and it failed to complete in an hour. I can't tell if they got through the first 500 because they failed before getting a chance to send content back for caching. I then tried 3000 targets using 200 jobs, left it for about 20 mins, and I then saw the Finally, I'm trying the 3000 again and I still see the SLURM jobs 10 minutes in. Doesn't seem like clustermq is quite robust enough to handle large HPC jobs yet. Maybe allowing for caching by the workers could improve things. |
Edit: the jobs disappeared between 10 and 20 minutes when trying to build 3000 targets using 200 jobs. |
Did at least some of the targets complete? You probably did not hit your wall time limit, but I have to ask anyway. Although the targets are processed in stages, they all share the same common pool of persistent workers (ref: mschubert/clustermq#86 (comment)). |
None of the targets got cached, but I couldn't tell if any targets were completed before they were sent back to the master for caching. And correct, I wasn't near my wall time limit. |
Would you be willing to try those same experiments on |
@kendonB Could it be that you run out of memory? Please try That should not happen, but I found that the guard using |
Doesn't look like memory. Looks like the system administration isn't into it:
|
I wonder what the objection could be. Please let me know if I should modify |
@kendonB have you been able to find out why these |
Never got to the bottom of it. The HPC has now undergone a revamp so it might work better now. They did say that cluster jobs shouldn't usually use the network so it might be network constraints. Of course, it saves on disk usage, which seems to bite far more often than network usage. Have to go through the hassle of installing zeromq again so I'll try this out again some time down the line. |
Yeah, |
Dynamic branching reduces the number of static targets, which helps everything get started faster. |
I am running make using
future_lapply
parallelism on a SLURM cluster viafuture.batchtools
withjobs = 100
.When examining the batchtools logs after 10 minutes, I still don't see the
target ...
message indicating that the worker has started working.When running the make with 2 targets and 2 jobs I see the
target ...
message a few seconds after they started.Can you give me some ideas for how to troubleshoot this?
The text was updated successfully, but these errors were encountered: