sequential job submission #105

marcosci · 2018-10-04T12:51:51Z

Hi Michael,

thanks a ton for clustermq, was really nice to play around with it so far!
I have a question about how one could use Q to submit a high number of jobs. I don't know if that is specific to our HPC (which uses LSF), but if I use clustermq all the jobs are submitted instantly. This is a huge advantage over batchtools and reduces the computation time significantly.

What I want to achieve is to submit each row of a data frame (with say 5000 rows) as a single job.
Apparently, I can only submit ~500 Jobs at once to our HPC, so I would like to send 500 Jobs there and after they are finished, I would like to send the next 500 jobs to the HPC.
However, clustermq seems to hold jobs open and submits new rows in the job it opened at the function call. This means that a single job could come the walltime of our cluster quite close.

Is it possible to submit for example 5000 jobs in chunks of 500 single jobs that are submitted at once, but after finishing them Q sends 500 new jobs?

With future I would do something like this to get this:

library("future")
library("future.batchtools")
library("furrr")

login <- tweak(remote, workers = "xxx", user = 'xxx')

bsub <- tweak(batchtools_lsf, template = 'lsf.tmpl', 
              resources = list(job.name = 'params_nlmr',
                               log.file = 'params_nlmr.log',
                               queue = 'mpi-short',
                               walltime = '00:30',
                               processes = 1)) 

plan(list(
  login,
  bsub,
  sequential 
))

# this submits 500 jobs, one after the other
hpc_sysinfo %<-%  future_map(seq_len(500), ~ Sys.info()})

Cheers
Marco

The text was updated successfully, but these errors were encountered:

wlandau · 2018-10-04T13:24:01Z

Interesting. What about the n_jobs and job_size arguments to Q() and Q_rows()? Is there a reason n_jobs = 500 does not meet your use case?

Also, I'm curious: are all your jobs independent? In other words, would you be able to submit all 5000 jobs at once if your sys admin let you?

mschubert · 2018-10-04T14:24:19Z

Hi Marco,
Good to hear that you find the package (somewhat) useful!

if I use clustermq all the jobs are submitted instantly

That's likely because we use job arrays instead of submitting each job sequentially. I think batchtools in principle supports this too, but there are some hoops to jump through.

This means that a single job could come the walltime of our cluster quite close

I'm still wondering how common that is. How long does one function call last, and what is the maximum wall time your cluster allows? Related to #101.

Is it possible to submit for example 5000 jobs in chunks of 500 single jobs that are submitted at once, but after finishing them Q sends 500 new jobs?

This is currently not supported. The simplest workaround would be:

result = list()
for (i in seq(1, nrow(df), 500))
    result = c(result, Q_rows(df[i:max(nrow(df), 500),], ..., job_size=1))

marcosci · 2018-10-04T14:42:24Z

That's likely because we use job arrays instead of submitting each job sequentially. I think batchtools in principle supports this too, but there are some hoops to jump through.

More than somewhat - I don't know how that works, but the scheduling is insanly fast (from pending to running, not only spawning the jobs).

I'm still wondering how common that is. How long does one function call last, and what is the maximum wall time your cluster allows? Related to #101.

I think I am using all of that in an quite unconvential way, so probably not that common. I am simulating and analysing artificial landscapes and that takes quite some time, so I single function call can last up to 4,5 hours. The max walltime I can get out of our multi purpose queue is 48 h. If i have more than 10 times the amount of jobs I can submit at once, I reach that limit...

This is currently not supported. The simplest workaround would be:

That makes sense, thanks 👍

@wlandau Yup, my jobs are all independent (or embarrassingly parallel 😋 ). I am doing a parameter space exploration and the scheduling with lsf gives me a bonus if I send small jobs (getting an exclusive node takes a day or two here vs getting a single core here and there is no problem at all).

marcosci closed this as completed Oct 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequential job submission #105

sequential job submission #105

marcosci commented Oct 4, 2018 •

edited

Loading

wlandau commented Oct 4, 2018

mschubert commented Oct 4, 2018 •

edited

Loading

marcosci commented Oct 4, 2018

sequential job submission #105

sequential job submission #105

Comments

marcosci commented Oct 4, 2018 • edited Loading

wlandau commented Oct 4, 2018

mschubert commented Oct 4, 2018 • edited Loading

marcosci commented Oct 4, 2018

marcosci commented Oct 4, 2018 •

edited

Loading

mschubert commented Oct 4, 2018 •

edited

Loading