Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting Makefile parallelism for SLURM #117

Closed
kendonB opened this issue Oct 28, 2017 · 26 comments
Closed

Troubleshooting Makefile parallelism for SLURM #117

kendonB opened this issue Oct 28, 2017 · 26 comments

Comments

@kendonB
Copy link
Contributor

kendonB commented Oct 28, 2017

As described here: #115

I am trying to get Makefile paralellism working using slurm.

First one I get the error Makefile:9: *** missing separator. Stop.:

library(drake)

simulate <- function(n){
  rnorm(n)
  print("simulating 3")
  Sys.sleep(20)
}

my_plan <- workplan(
  primer1 = simulate(20),
  primer2 = simulate(10),
  data1 = primer1 + 1,
  data2 = primer2 + 2,
  result = mean(c(data1, data2))
)

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "module load R"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)
Makefile:9: *** missing separator.  Stop.

I can't seem to find the makefile itself to see what it's actually producing. Is there a way to produce the makefile only without running it?

@wlandau-lilly
Copy link
Collaborator

make(my_plan, parallelism = "Makefile", args = c("--touch", "--silent"))

I should put that one in the parallelism vignette, thanks for another spot on idea.

Unfortunately, the Makefile is not really human readable. I am using dummy timestamp files to hack Make into only submitting the necessary jobs. It's @krlmlr's idea from wlandau/parallelRemake#4.

For the purposes of this thread, did you only want a better way to troubleshoot, or do you also want a Makefile configured for SLURM? Because I want that too, but I cannot help with that at the moment.

wlandau-lilly added a commit that referenced this issue Oct 28, 2017
As @kendonB brought up, users need to know how
to print the Makefile without actually running it:
make(my_plan, parallelism = "Makefile", args = c("--touch", "--silent"))
Now in the parallelism vignette.
@kendonB
Copy link
Contributor Author

kendonB commented Oct 28, 2017

Sorry, to clarify, I'm trying to troubleshoot the error Makefile:9: *** missing separator. Stop. with the ultimate goal of configuring my makefile for submitting slurm jobs.

I've added the arguments and my example ran; but where is the makefile? The .makefile folder is empty after I run this.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 28, 2017

After a passable night's sleep, I think I know what the problem is. GNU Make thinks module load R is a recipe, and it think you need a tab to indent it. You may be able to fool it with something like

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "TMP=`module load R`"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)

But I think you need a special shell.sh, as described here. You can generate a starter with shell_file(). Your shell.sh should probably look something like this.

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
shift
echo "module load R; $*" | srun

And then you call make(..., prepend=SHELL=./shell.sh)`.

I have not tried this, but an alternative might be the regular shell.sh file written by shell_sh():

#!/bin/bash
shift
echo "module load R; $*" | srun

with

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "#!/bin/bash",
    "#SBATCH -J testing", 
    "#SBATCH -A landcare00063",
    "#SBATCH --time=1:00:00",
    "#SBATCH --cpus-per-task=1",
    "#SBATCH --begin=now",
    "#SBATCH --mem=1G", 
    "#SBATCH -C sb",
    "SHELL=./shell.sh"
  ),
  recipe_command = "srun Rscript -e 'R_RECIPE'"
)

...and you really don't see a Makefile? That's odd, it should write to your working directory at the time you call make(). I don't see how you could get Makefile:9: *** missing separator. Stop. otherwise.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 28, 2017

By the way, if you get it working, I have colleagues from grad school who would really benefit. It would be a great help if you share your solution, maybe here in the parallelism vignette, maybe in an example like Makefile-cluster.

@wlandau-lilly wlandau-lilly added this to the CRAN release 4.4.0 milestone Oct 28, 2017
@kendonB
Copy link
Contributor Author

kendonB commented Oct 28, 2017

Alright, we're progressing! Found the makefile; thanks!

I tried creating a shell.sh with this in it:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
shift
echo "module load R; $*" | srun

And I get this:

make(
  plan = my_plan,
  targets = c("data1", "data2"), # `primer` is built too
  parallelism = "Makefile",
  jobs = 2,
  recipe_command = "srun Rscript -e 'R_RECIPE'", 
  prepend="SHELL=./shell.sh"
)
check 3 items: print, rnorm, Sys.sleep
import print
import rnorm
import Sys.sleep
check 1 item: simulate
import simulate
srun Rscript -e 'drake::mk(target = "primer1", cache_path = "<wd>/.drake")'
srun Rscript -e 'drake::mk(target = "primer2", cache_path = "<wd>/.drake")'
srun: fatal: No command given to execute.
srun: fatal: No command given to execute.
make: *** [<wd>/.drake/ts/3c356dca4040e3c4] Error 1
make: *** Waiting for unfinished jobs....
make: *** [<wd>/.drake/ts/b3a79b8e12e4bcd5] Error 1

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 29, 2017

Maybe recipe_command = "srun bash -c Rscript -e 'R_RECIPE'"? I wish I could test it myself.

FWIW, this approach dates back to this blog post. My colleagues and I were using that approach in grad school, and it was super convenient at the time. But then they told me it had apparently stopped working, and by then I had graduated and could no longer access the cluster.

@wlandau-lilly
Copy link
Collaborator

No shell file required, but might not work: make(..., prepend = c("SHELL=srun", ".SHELLFLAGS= -n1 -n1 bash -c")).

@kendonB
Copy link
Contributor Author

kendonB commented Oct 29, 2017

With ``recipe_command = "srun bash -c Rscript -e 'R_RECIPE'"```, I get the same error as above.

With make(..., prepend = c("SHELL=srun", ".SHELLFLAGS= -n1 -n1 bash -c")), and adding my configuration lines to prepend, I get missing separator error again.

This might help: https://mussolblog.wordpress.com/2013/07/17/setting-up-a-testing-slurm-cluster/

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 29, 2017

That's unfortunate. If srun accepts commands, there should be a way to tell it that Rscript is one too. I would very much prefer this solution. If you can reproduce it with a single srun Rscript -e 'print(1234)', it would be great if you would push this to Stack Overflow.

Do you still have a module load R line in your prepend? I think that is what is behind the missing separator. But come to think of it, module load R won't be executed on the nodes at all if it is just a line prepended to the Makefile. In fact, all those prepend lines really belong to each individual job submission, ideally direct arguments to srun via the recipe_command. They should have no effect as part of the Makefile.

Thank you for sending the Vagrant example. Unfortunately, copying over my munge key timed out.

sudo scp /etc/munge/munge.key vagrant@server:/home/vagrant/
ssh: connect to host server port 22: Connection timed out
lost connection

With the trouble I'm having installing job schedulers, maybe learning Docker or Vagrant is the next step in all this.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 29, 2017

I can't run srun Rscript -e 'print(1234)' without wrapping it in a shell script, if that's what you were asking for.

It might help to make sure we're on the same page for I usually do this.

I write a testing.sl file with:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript workflow.R

Then submit the job using:

sbatch testing.sl

The sbatch command reads the configuration commands and submits the srun(s) to the scheduler.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 29, 2017

Can testing.sl accept arguments like an ordinary shell script? Maybe something like:

#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript -e '$1'

with

make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "sbatch testing.sl 'R_RECIPE'"
)

@wlandau-lilly
Copy link
Collaborator

On second thought, rather than deal with shell scripts with arguments, it may be better to go back to your earlier attempt with shell.sh and replace srun with sbatch (leaving recipe_command alone). You could try moving the #SBATCH parameters inline with sbatch, or at the top of the Makefile in case I was wrong about that bit.

@wlandau-lilly wlandau-lilly changed the title Troubleshooting Makefile parallelism Troubleshooting Makefile parallelism for SLURM Oct 29, 2017
@wlandau-lilly wlandau-lilly removed this from the CRAN release 4.4.0 milestone Oct 29, 2017
@kendonB
Copy link
Contributor Author

kendonB commented Oct 29, 2017

I tried your second most recent suggestion and it successfully submits jobs. However, they all failed with the following error:

Error: unexpected '$' in "$"
Execution halted
srun: error: compute-d1-020: task 0: Exited with exit code 1

All 5 jobs got submitted at once as well, so the solution didn't seem to obey the dependency rules.

I'm not sure I understand your most recent suggestion.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 29, 2017

If all 5 jobs got submitted at once, that makes me think we should always be using srun (blocking) rather than sbatch (non-blocking). And as for the $ error, that's probably a minor syntax mistake.

So maybe this?

# testing.sl
#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
`$1`
# in R
make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "srun testing.sl 'R_RECIPE'"
)

The later suggestion probably won't work anyway.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 29, 2017

It doesn't seem to register the account when running with srun.

srun testing.sl 'drake::mk(target = "primer1", cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")'
srun testing.sl 'drake::mk(target = "primer2", cache_path = "/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake")'
srun: error: Invalid account used
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
srun: error: Invalid account used
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
make: *** [/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake/ts/b3a79b8e12e4bcd5] Error 1
make: *** Waiting for unfinished jobs....
make: *** [/gpfs1m/projects/landcare00063/projects_ac/measurementerrorandclimate/.drake/ts/3c356dca4040e3c4] Error 

Should it not include multiple srun commands within the .sl file and run with sbatch at the terminal?
See Running Multiple Parallel Jobs Sequentially here: http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/

@wlandau-lilly
Copy link
Collaborator

Then maybe SLURM doesn't see the #SBATCH args. Maybe revert back and try different ways to quote $1? Or not quote $1 at all?

# testing.sl
#!/bin/bash
#SBATCH -J testing
#SBATCH -A landcare00063
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
#SBATCH --begin=now
#SBATCH --mem=1G
#SBATCH -C sb
module load R
srun Rscript -e '\"$1\"' # Maybe play around here.
# in R
make(
  your_plan,
  parallelism = "Makefile",
  jobs = 8,
  recipe_command = "sbatch testing.sl 'R_RECIPE'"
)

@kendonB
Copy link
Contributor Author

kendonB commented Oct 30, 2017

The above runs but again it submits all 5 jobs at once again. I tried a bunch of permutations of '\"$1\"' and nothing worked. I think at this point the best way forward is for you to get a test SLURM scheduler working for yourself as the back and forth is quite inefficient. I'm sorry I wasn't able to get this all the way to the end!

@wlandau-lilly
Copy link
Collaborator

Yeah, that sounds like the best plan. I really am trying:

Resources that I tried but could not get to work:

plus several forums. I still get

$ slurmd
slurmd: fatal: Frontend not configured correctly in slurm.conf.  See man slurm.conf look for frontendname.

When I do get SLURM, probably the first thing I will do is test the --wait flag for sbatch, as in recipe_command = "sbatch --wait testing.sl 'R_RECIPE'".

@wlandau-lilly
Copy link
Collaborator

I really need all the help I can get to get SLURM working on Ubuntu 16.04.

@wlandau-lilly
Copy link
Collaborator

As I mentioned in #115, I got SLURM to run on a Debian VM. (I followed this guide, substituting in my own user name instead of slurm, and setting both the master and node names to Debian64 (hostname of the VM)). The following worked perfectly for me.

library(drake)
load_basic_example()
make(
  my_plan,
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS=-N1 -n1 bash -c"
  )
)

I am simultaneously stoked that something this simple actually worked and bothered that I cannot reproduce everyone's errors. I thought it might be because I listed myself in /etc/sudoers, but then it still worked when I took myself off and tried again. It could be something different about the real cluster environment.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 30, 2017

How would one add the SBATCH configuration in the above?

@wlandau-lilly
Copy link
Collaborator

Command line argument to srun should cover it here. For example, I can still set the job name (thought it's silly to have so many jobs with the same name).

make(
  my_plan,
  parallelism = "Makefile",
  jobs = 2,
  prepend = c(
    "SHELL=srun",
    ".SHELLFLAGS=-N1 -n1 bash -c"
  )
)

squeue showed me that they're all named testjob. But anyway, I pushed it to Stack Overflow.

@wlandau-lilly
Copy link
Collaborator

@kendonB Please see the response on Stack Overflow. srun should be able to meet our needs.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Oct 30, 2017

Not surprisingly, SLURM arrays are not an option with this approach. The new rslurm package would cover this as a separate special backend. Given the other bottlenecks from drake itself, accommodating this special case may or may not be worth the efficiency gains.

@wlandau-lilly
Copy link
Collaborator

@kendonB, from what you learned solving #115, do you think #117 could be solved the same way? Is it even worth the time now that you have #115? If you no longer need #117 to work, please let me know. Makefiles with srun seem to work for me, so I would prefer to either troubleshoot more with you or close the issue.

@kendonB
Copy link
Contributor Author

kendonB commented Oct 31, 2017

I had that sb config flag on this one, so this thread would have been a separate problem. Since #115 seems to be working for me, let's just close this issue until someone says they have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants