Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sequential job submission #105

Closed
marcosci opened this issue Oct 4, 2018 · 3 comments
Closed

sequential job submission #105

marcosci opened this issue Oct 4, 2018 · 3 comments

Comments

@marcosci
Copy link
Contributor

marcosci commented Oct 4, 2018

Hi Michael,

thanks a ton for clustermq, was really nice to play around with it so far!
I have a question about how one could use Q to submit a high number of jobs. I don't know if that is specific to our HPC (which uses LSF), but if I use clustermq all the jobs are submitted instantly. This is a huge advantage over batchtools and reduces the computation time significantly.

What I want to achieve is to submit each row of a data frame (with say 5000 rows) as a single job.
Apparently, I can only submit ~500 Jobs at once to our HPC, so I would like to send 500 Jobs there and after they are finished, I would like to send the next 500 jobs to the HPC.
However, clustermq seems to hold jobs open and submits new rows in the job it opened at the function call. This means that a single job could come the walltime of our cluster quite close.

Is it possible to submit for example 5000 jobs in chunks of 500 single jobs that are submitted at once, but after finishing them Q sends 500 new jobs?

With future I would do something like this to get this:

library("future")
library("future.batchtools")
library("furrr")

login <- tweak(remote, workers = "xxx", user = 'xxx')

bsub <- tweak(batchtools_lsf, template = 'lsf.tmpl', 
              resources = list(job.name = 'params_nlmr',
                               log.file = 'params_nlmr.log',
                               queue = 'mpi-short',
                               walltime = '00:30',
                               processes = 1)) 

plan(list(
  login,
  bsub,
  sequential 
))

# this submits 500 jobs, one after the other
hpc_sysinfo %<-%  future_map(seq_len(500), ~ Sys.info()})

Cheers
Marco

@wlandau
Copy link
Contributor

wlandau commented Oct 4, 2018

Interesting. What about the n_jobs and job_size arguments to Q() and Q_rows()? Is there a reason n_jobs = 500 does not meet your use case?

Also, I'm curious: are all your jobs independent? In other words, would you be able to submit all 5000 jobs at once if your sys admin let you?

@mschubert
Copy link
Owner

mschubert commented Oct 4, 2018

Hi Marco,
Good to hear that you find the package (somewhat) useful!

if I use clustermq all the jobs are submitted instantly

That's likely because we use job arrays instead of submitting each job sequentially. I think batchtools in principle supports this too, but there are some hoops to jump through.

This means that a single job could come the walltime of our cluster quite close

I'm still wondering how common that is. How long does one function call last, and what is the maximum wall time your cluster allows? Related to #101.

Is it possible to submit for example 5000 jobs in chunks of 500 single jobs that are submitted at once, but after finishing them Q sends 500 new jobs?

This is currently not supported. The simplest workaround would be:

result = list()
for (i in seq(1, nrow(df), 500))
    result = c(result, Q_rows(df[i:max(nrow(df), 500),], ..., job_size=1))

@marcosci
Copy link
Contributor Author

marcosci commented Oct 4, 2018

That's likely because we use job arrays instead of submitting each job sequentially. I think batchtools in principle supports this too, but there are some hoops to jump through.

More than somewhat - I don't know how that works, but the scheduling is insanly fast (from pending to running, not only spawning the jobs).

I'm still wondering how common that is. How long does one function call last, and what is the maximum wall time your cluster allows? Related to #101.

I think I am using all of that in an quite unconvential way, so probably not that common. I am simulating and analysing artificial landscapes and that takes quite some time, so I single function call can last up to 4,5 hours. The max walltime I can get out of our multi purpose queue is 48 h. If i have more than 10 times the amount of jobs I can submit at once, I reach that limit...

This is currently not supported. The simplest workaround would be:

That makes sense, thanks 👍

@wlandau Yup, my jobs are all independent (or embarrassingly parallel 😋 ). I am doing a parameter space exploration and the scheduling with lsf gives me a bonus if I send small jobs (getting an exclusive node takes a day or two here vs getting a single core here and there is no problem at all).

@marcosci marcosci closed this as completed Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants